[compiler] Do not mix kernels with different sub-group sizes. #649
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
[compiler] Do not mix kernels with different sub-group sizes.
Reason for change
Per OpenCL 3.0 API 3.2.1 Mapping Work-items Onto an Nd-range, all sub-groups within a work-group will be the same size, apart from the sub-group with the maximum index which may be smaller if the size of the work-group is not evenly divisible by the size of the sub-groups. We were not meeting this requirement: in cases where we would not or could not generate a predicated vectorized kernel, we would execute the scalar kernel in a loop for any remaining work items, possibly resulting in multiple sub-groups that are smaller than the maximum sub-group size.
Description of change
To avoid this situation, we need to avoid mixing vector and scalar kernels if those kernels use different sub-group sizes. If we can handle all items with vector kernels, possibly with predication, continue to do so. If the vector and scalar kernels do not depend on the sub-group size, also continue to handle this as before. If the vector and scalar kernels do depend on the sub-group size, and the vector kernel cannot handle all work items, we need to switch to the scalar kernel for all work items.
Anything else we should know?
This includes a small optimization where if we know the kernel does not use sub-group information, we avoid setting sub-group IDs.
This includes one change to createLoop which permits nullptr PHIs. They will be skipped over, and are useful since PHIs must be referred to by index in the callback function. This allows indices to be constant even when the caller has multiple optional PHIs.
This also includes one bugfix to ControlFlowConversionPass to fix a crash seen now, where we use the result of createMasked{Load,Store} before checking whether it succeeded.
This also includes one improvement to CompileKernelToBin.cmake. If the executed command fails, it will now be printed in a format that can be copied and pasted.
Checklist