Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More register used when multiple target regions are compiled together #24

Closed
ye-luo opened this issue Aug 3, 2019 · 4 comments
Closed
Assignees

Comments

@ye-luo
Copy link

ye-luo commented Aug 3, 2019

The source code I'm using has multiple offload regions in different member functions of a class.
If I enable individual target region and comment the other target pragma
Kernel 1 only

      NumSGPRs:        90
      NumVGPRs:        256
      NumSpilledVGPRs: 158

kernel 2 only

      NumSGPRs:        86
      NumVGPRs:        164

If I enabled both offload regions.
kernel 1

      NumSGPRs:        90
      NumVGPRs:        256
      NumSpilledVGPRs: 160

kernal 2

      NumSGPRs:        86
      NumVGPRs:        256
      NumSpilledVGPRs: 160

The amount of needed vector register + spill is more than individually ones.
Both kernels are compiled from independent target regions. This behaviour seems very strange.

@gregrodgers
Copy link
Contributor

This is good information. I would like to know what optimization level if any you requested. Can you attach your source and command line? Thank you.

@ye-luo
Copy link
Author

ye-luo commented Aug 6, 2019

reproducer

git clone https://github.com/ye-luo/miniqmc
cd miniqmc/build
cmake -DCMAKE_CXX_COMPILER=/home/yeluo/rocm/aomp_0.7-0/bin/clang++ \
-DENABLE_OFFLOAD=1 -DOFFLOAD_TARGET=amdgcn-amd-amdhsa \
-DCMAKE_CXX_FLAGS="-Xopenmp-target=amdgcn-amd-amdhsa -march=gfx906 -v" \
..
make -j15 check_spo_batched

src/QMCWaveFunctions/einspline_spo_omp.cpp
line 159, 238, 311, 405 have offload regions for heavy computation.
The kernel at 311 has

    CodeProps:
      KernargSegmentSize: 72
      GroupSegmentFixedSize: 1024
      PrivateSegmentFixedSize: 872
      KernargSegmentAlign: 8
      WavefrontSize:   64
      NumSGPRs:        92
      NumVGPRs:        256
      MaxFlatWorkGroupSize: 256
      NumSpilledVGPRs: 375

Now just comment the #pragma omp 149, 238, 405 but leave 311.

make -j15 check_spo_batched
    CodeProps:
      KernargSegmentSize: 72
      GroupSegmentFixedSize: 766
      PrivateSegmentFixedSize: 48
      KernargSegmentAlign: 8
      WavefrontSize:   64
      NumSGPRs:        88
      NumVGPRs:        250
      MaxFlatWorkGroupSize: 256

The NumVGPRs reduces and there is no spill.

Another test, if I add right before line 311

#pragma omp target
{ }

The newly added kernel has

    CodeProps:
      KernargSegmentSize: 0
      GroupSegmentFixedSize: 754
      PrivateSegmentFixedSize: 0
      KernargSegmentAlign: 4
      WavefrontSize:   64
      NumSGPRs:        40
      NumVGPRs:        248
      MaxFlatWorkGroupSize: 256

All the numbers are significantly larger than the numbers given when the empty offload region is compiled standalone.

@JonChesterfield
Copy link
Contributor

JonChesterfield commented Aug 14, 2020

-v doesn't produce this output anymore. A potentially useful alternative is -mllvm -amdgpu-dump-hsa-metadata
, which produces yaml output like:

AMDGPU HSA Metadata:
---
amdhsa.kernels:
  - .args:
      - .address_space:  generic
        .name:           isHost
        .offset:         0
        .size:           8
        .value_kind:     global_buffer
    .group_segment_fixed_size: 915
    .kernarg_segment_align: 8
    .kernarg_segment_size: 8
    .language:       OpenCL C
    .language_version:
      - 2
      - 0
    .max_flat_workgroup_size: 256
    .name:           __omp_offloading_fd00_261b86_kernel_l7
    .private_segment_fixed_size: 0
    .sgpr_count:     25
    .sgpr_spill_count: 0
    .symbol:         __omp_offloading_fd00_261b86_kernel_l7.kd
    .vgpr_count:     22
    .vgpr_spill_count: 0
    .wavefront_size: 64

Alternatively, that is available by reading the msgpack data from the shared library (elf) containing device code.

@ronlieb
Copy link
Contributor

ronlieb commented Nov 21, 2023

Hi Ye, this one is over 3 years old, closing. if still an issue please reopen,, or open new issue.

@ronlieb ronlieb closed this as completed Nov 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants