Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support fusing Transpose + MatMul where both inputs are transposed #398

Open
robertknight opened this issue Oct 29, 2024 · 0 comments
Open
Labels
performance Issues that affect model inference or loading performance

Comments

@robertknight
Copy link
Owner

The decoder of the Whisper example that was recently added spends a large fraction of the time in Transpose operations, especially for larger models. Many of these come from subgraphs that look like:

[Input A] -> Transpose -> MatMul ---> [Output]
[Input B] -> Transpose ------^

Where [Input A] is computed by earlier parts of the graph and [Input B] is a KV-cache tensor from a previous run. This subgraph is inside an If subgraph when using the "merged" decoder, so [Input B] is a captured view.

The graph optimizer currently supports fusing transpose + matmul when only one of the inputs is transposed, but not when both are.

@robertknight robertknight added the performance Issues that affect model inference or loading performance label Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Issues that affect model inference or loading performance
Projects
None yet
Development

No branches or pull requests

1 participant