-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816
Comments
1024 is the conservative default is always executable. CUDA, unlike OpenCL, doesn't require checking the supported workgroup size before executing, so for compatibility the backend has to support the maximum workgroup size by default. The clang-chosen default for OpenCL is maximum 256. If OpenMP only executes kernels under controlled situations the default could be decreased. |
@llvm/issue-subscribers-openmp |
@llvm/issue-subscribers-backend-amdgpu |
The problem with OpenMP is that it sets a maximum of 1024 and allows environment variables to change the number of thread. Meaning, even if the number of threads is stated statically in the program we need to assume it can change at runtime. |
Which environment variables were you referring to? |
Given the kernel is compiled with max_flat_workgroup_size, environment variable can only reduce but not increase the workgroup size at run, right? I saw MaxFlatWorkgroupSize in the plugin but it doesn't seem being used. On AMD GPU, I noticed 256 as the default workgroup size at run (shown by rocprof). |
A related question asked to AOMP ROCm/aomp#614 |
It's just in general there are |
OMP_NUM_THREADS never change GPU threads. Every implementation should respect that. I guess you meant OMP_THREAD_LIMIT. |
There are (supposed to be) env vars for devices too. That said, in your case it's somewhat easy since you already provide a constant thread_limit. We'll look into this. |
I don't always set |
I understand, but that requires more machinery or at least verification that we won't go over the 256 limit at runtime. I think we need to record assumptions such that the runtime can pick them up and guarantee them. The constant case is easier though. |
I also checked adding |
Only partially fixes with df8d33f |
No it is not. This is a user error. The attribute states it is the users responsibility. We could/should warn/error, but still, a user bug. |
Right now the runtime ignores |
…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.
…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.
…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.
…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.
…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.
…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.
…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.
Currently clang sets
max_flat_workgroup_size
always to 1024 and causes register spillI tested overriding the default using
ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))
and got 2x kernel speed-up.
The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.
thread_limit(192)
clause, can the compiler take advantage of it?The text was updated successfully, but these errors were encountered: