Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compiler] Do not mix kernels with different sub-group sizes. #649

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hvdijk
Copy link
Collaborator

@hvdijk hvdijk commented Jan 18, 2025

Overview

[compiler] Do not mix kernels with different sub-group sizes.

Reason for change

Per OpenCL 3.0 API 3.2.1 Mapping Work-items Onto an Nd-range, all sub-groups within a work-group will be the same size, apart from the sub-group with the maximum index which may be smaller if the size of the work-group is not evenly divisible by the size of the sub-groups. We were not meeting this requirement: in cases where we would not or could not generate a predicated vectorized kernel, we would execute the scalar kernel in a loop for any remaining work items, possibly resulting in multiple sub-groups that are smaller than the maximum sub-group size.

Description of change

To avoid this situation, we need to avoid mixing vector and scalar kernels if those kernels use different sub-group sizes. If we can handle all items with vector kernels, possibly with predication, continue to do so. If the vector and scalar kernels do not depend on the sub-group size, also continue to handle this as before. If the vector and scalar kernels do depend on the sub-group size, and the vector kernel cannot handle all work items, we need to switch to the scalar kernel for all work items.

Anything else we should know?

This includes a small optimization where if we know the kernel does not use sub-group information, we avoid setting sub-group IDs.

This includes one change to createLoop which permits nullptr PHIs. They will be skipped over, and are useful since PHIs must be referred to by index in the callback function. This allows indices to be constant even when the caller has multiple optional PHIs.

This also includes one bugfix to ControlFlowConversionPass to fix a crash seen now, where we use the result of createMasked{Load,Store} before checking whether it succeeded.

This also includes one improvement to CompileKernelToBin.cmake. If the executed command fails, it will now be printed in a format that can be copied and pasted.

Checklist

  • Read and follow the project Code of Conduct.
  • Make sure the project builds successfully with your changes.
  • Run relevant testing locally to avoid regressions.
  • Run clang-format-19 on all modified code.

Per OpenCL 3.0 API 3.2.1 Mapping Work-items Onto an Nd-range, all
sub-groups within a work-group will be the same size, apart from the
sub-group with the maximum index which may be smaller if the size of the
work-group is not evenly divisible by the size of the sub-groups. We
were not meeting this requirement: in cases where we would not or could
not generate a predicated vectorized kernel, we would execute the scalar
kernel in a loop for any remaining work items, possibly resulting in
multiple sub-groups that are smaller than the maximum sub-group size.

To avoid this situation, we need to avoid mixing vector and scalar
kernels if those kernels use different sub-group sizes. If we can handle
all items with vector kernels, possibly with predication, continue to do
so. If the vector and scalar kernels do not depend on the sub-group
size, also continue to handle this as before. If the vector and scalar
kernels do depend on the sub-group size, and the vector kernel cannot
handle all work items, we need to switch to the scalar kernel for all
work items.

This includes a small optimization where if we know the kernel does not
use sub-group information, we avoid setting sub-group IDs.

This includes one change to createLoop which permits nullptr PHIs. They
will be skipped over, and are useful since PHIs must be referred to by
index in the callback function. This allows indices to be constant even
when the caller has multiple optional PHIs.

This also includes one bugfix to ControlFlowConversionPass to fix a
crash seen now, where we use the result of createMasked{Load,Store}
before checking whether it succeeded.

This also includes one improvement to CompileKernelToBin.cmake. If the
executed command fails, it will now be printed in a format that can be
copied and pasted.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant