[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

ye-luo · 2023-08-18T20:01:09Z

Currently clang sets max_flat_workgroup_size always to 1024 and causes register spill

    .max_flat_workgroup_size: 1024
    .name:           __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
    .private_segment_fixed_size: 264
    .sgpr_count:     60
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_32_7a3077cd__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
    .vgpr_count:     128
    .vgpr_spill_count: 66
    .wavefront_size: 64

I tested overriding the default using ompx_attribute(__attribute__((amdgpu_flat_work_group_size(128, 256))))
and got 2x kernel speed-up.

    .max_flat_workgroup_size: 256
    .name:           __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413
    .private_segment_fixed_size: 0
    .sgpr_count:     58
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_32_7a4aee56__ZN11qmcplusplus17einspline_spo_ompIfE18multi_evaluate_vghERKSt6vectorIPNS_6SPOSetESaIS4_EERKS2_IPNS_11ParticleSetESaISA_EEib_l413.kd
    .vgpr_count:     166
    .vgpr_spill_count: 0
    .wavefront_size: 64

The default 1024 is clearly very bad in this case. When I code cuda, even 1024 is supported, I really use 1024 but mostly 128 or 256.

Can max_flat_workgroup_size be chosen at linking when the needed vgpr got figured out?
When I specify thread_limit(192) clause, can the compiler take advantage of it?

The text was updated successfully, but these errors were encountered:

arsenm · 2023-08-18T20:12:03Z

1024 is the conservative default is always executable. CUDA, unlike OpenCL, doesn't require checking the supported workgroup size before executing, so for compatibility the backend has to support the maximum workgroup size by default.

The clang-chosen default for OpenCL is maximum 256. If OpenMP only executes kernels under controlled situations the default could be decreased.

llvmbot · 2023-08-18T20:12:37Z

@llvm/issue-subscribers-openmp

llvmbot · 2023-08-18T20:12:40Z

@llvm/issue-subscribers-backend-amdgpu

jhuber6 · 2023-08-18T20:13:47Z

The problem with OpenMP is that it sets a maximum of 1024 and allows environment variables to change the number of thread. Meaning, even if the number of threads is stated statically in the program we need to assume it can change at runtime.

ye-luo · 2023-08-18T21:47:44Z

The problem with OpenMP is that it sets a maximum of 1024 and allows environment variables to change the number of thread. Meaning, even if the number of threads is stated statically in the program we need to assume it can change at runtime.

Which environment variables were you referring to?

ye-luo · 2023-08-18T21:59:14Z

Given the kernel is compiled with max_flat_workgroup_size, environment variable can only reduce but not increase the workgroup size at run, right?

I saw MaxFlatWorkgroupSize in the plugin but it doesn't seem being used. On AMD GPU, I noticed 256 as the default workgroup size at run (shown by rocprof).

ye-luo · 2023-08-18T21:59:53Z

A related question asked to AOMP ROCm/aomp#614

jhuber6 · 2023-08-18T22:00:58Z

It's just in general there are OMP_NUM_THREADS environment variables that can be set by the user at any time. Because of that we can't just assume the thread count even when constant. I believe AOMP has optimizations that specifically turn it off to spite the standard.

ye-luo · 2023-08-18T22:05:52Z

OMP_NUM_THREADS never change GPU threads. Every implementation should respect that. I guess you meant OMP_THREAD_LIMIT.

jdoerfert · 2023-08-18T22:41:24Z

There are (supposed to be) env vars for devices too. That said, in your case it's somewhat easy since you already provide a constant thread_limit. We'll look into this.

ye-luo · 2023-08-18T22:47:46Z

I don't always set thread_limit. Since the runtime usually go with 256 at run, would prefer a consistent default and users may override via command line option or attributes in the source code.

jdoerfert · 2023-08-19T00:00:26Z

I understand, but that requires more machinery or at least verification that we won't go over the 256 limit at runtime. I think we need to record assumptions such that the runtime can pick them up and guarantee them. The constant case is easier though.

ye-luo · 2023-08-19T04:22:22Z

I also checked adding ompx_attribute(__attribute__((amdgpu_flat_work_group_size(45, 90)))) with thread_limit removed. I got .max_flat_workgroup_size: 90. However, the runtime still dispatch workgroup size 256(runtime default). This is a bug.

jdoerfert · 2023-08-19T04:49:34Z

Only partially fixes with df8d33f

jdoerfert · 2023-08-19T04:50:30Z

I also checked adding ompx_attribute(__attribute__((amdgpu_flat_work_group_size(45, 90)))) with thread_limit removed. I got .max_flat_workgroup_size: 90. However, the runtime still dispatch workgroup size 256(runtime default). This is a bug.

No it is not. This is a user error. The attribute states it is the users responsibility. We could/should warn/error, but still, a user bug.

ye-luo · 2023-08-19T05:18:36Z

Right now the runtime ignores max_flat_workgroup_size info from kernel and blindly run with the runtime default. I consider this a bug.

…ounts If we know the thread count statically and it is a constant, we can set the "amdgpu-flat-work-group-size" kernel attribute. Fixes llvm#64816 in parts.

github-actions bot added the new issue label Aug 18, 2023

arsenm added openmp backend:AMDGPU and removed new issue labels Aug 18, 2023

jdoerfert closed this as completed in df8d33f Aug 19, 2023

jdoerfert reopened this Aug 19, 2023

jdoerfert closed this as completed in 7481b46 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

ye-luo commented Aug 18, 2023

arsenm commented Aug 18, 2023

llvmbot commented Aug 18, 2023

llvmbot commented Aug 18, 2023

jhuber6 commented Aug 18, 2023

ye-luo commented Aug 18, 2023

ye-luo commented Aug 18, 2023

ye-luo commented Aug 18, 2023

jhuber6 commented Aug 18, 2023

ye-luo commented Aug 18, 2023

jdoerfert commented Aug 18, 2023

ye-luo commented Aug 18, 2023

jdoerfert commented Aug 19, 2023

ye-luo commented Aug 19, 2023

jdoerfert commented Aug 19, 2023

jdoerfert commented Aug 19, 2023

ye-luo commented Aug 19, 2023

[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

[OpenMP] amdgpu bad choice of max_flat_workgroup_size #64816

Comments

ye-luo commented Aug 18, 2023

arsenm commented Aug 18, 2023

llvmbot commented Aug 18, 2023

llvmbot commented Aug 18, 2023

jhuber6 commented Aug 18, 2023

ye-luo commented Aug 18, 2023

ye-luo commented Aug 18, 2023

ye-luo commented Aug 18, 2023

jhuber6 commented Aug 18, 2023

ye-luo commented Aug 18, 2023

jdoerfert commented Aug 18, 2023

ye-luo commented Aug 18, 2023

jdoerfert commented Aug 19, 2023

ye-luo commented Aug 19, 2023

jdoerfert commented Aug 19, 2023

jdoerfert commented Aug 19, 2023

ye-luo commented Aug 19, 2023