Generate simpler LLVM IR for shuffles that recursively become broadcasts #7902
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
LLVM 18 is causing us to fail the test performance_nested_vectorization_gemm on arm without arm_dot_prod as of a few days ago. The cause seems to be that we generate very complex chains of shuffles for the following piece of IR:
shuffle({some_u8x2}, {0, 1, 0, 1, 0, 1, 0, 1});
The current CodeGen_LLVM Shuffle handler reinterprets the u8x2 as a u16, and creates the following shuffle:
shuffle({reinterpret<u16>(some_u8x2)}, {0, 0, 0, 0});
and then recursively calls the shuffle codegen visitor. The shuffle codegen visitor doesn't have a special case for broadcasts, so it sees this as a degenerate self-interleave, and produces a complex binary tree of shuffles.
This PR instead detects broadcasts of a single lane of a single vector, and uses the existing broadcast handling.