You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The decoder of the Whisper example that was recently added spends a large fraction of the time in Transpose operations, especially for larger models. Many of these come from subgraphs that look like:
Where [Input A] is computed by earlier parts of the graph and [Input B] is a KV-cache tensor from a previous run. This subgraph is inside an If subgraph when using the "merged" decoder, so [Input B] is a captured view.
The graph optimizer currently supports fusing transpose + matmul when only one of the inputs is transposed, but not when both are.
The text was updated successfully, but these errors were encountered:
The decoder of the Whisper example that was recently added spends a large fraction of the time in Transpose operations, especially for larger models. Many of these come from subgraphs that look like:
Where
[Input A]
is computed by earlier parts of the graph and[Input B]
is a KV-cache tensor from a previous run. This subgraph is inside anIf
subgraph when using the "merged" decoder, so[Input B]
is a captured view.The graph optimizer currently supports fusing transpose + matmul when only one of the inputs is transposed, but not when both are.
The text was updated successfully, but these errors were encountered: