Support fusing Transpose + MatMul where both inputs are transposed #398

robertknight · 2024-10-29T09:00:30Z

The decoder of the Whisper example that was recently added spends a large fraction of the time in Transpose operations, especially for larger models. Many of these come from subgraphs that look like:

[Input A] -> Transpose -> MatMul ---> [Output]
[Input B] -> Transpose ------^

Where [Input A] is computed by earlier parts of the graph and [Input B] is a KV-cache tensor from a previous run. This subgraph is inside an If subgraph when using the "merged" decoder, so [Input B] is a captured view.

The graph optimizer currently supports fusing transpose + matmul when only one of the inputs is transposed, but not when both are.

The text was updated successfully, but these errors were encountered:

robertknight added the performance Issues that affect model inference or loading performance label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fusing Transpose + MatMul where both inputs are transposed #398

Support fusing Transpose + MatMul where both inputs are transposed #398

robertknight commented Oct 29, 2024

Support fusing Transpose + MatMul where both inputs are transposed #398

Support fusing Transpose + MatMul where both inputs are transposed #398

Comments

robertknight commented Oct 29, 2024