Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit tests do not work on AMD GPUs #48

Closed
mirenradia opened this issue Feb 27, 2024 · 1 comment · Fixed by #49
Closed

Unit tests do not work on AMD GPUs #48

mirenradia opened this issue Feb 27, 2024 · 1 comment · Fixed by #49
Assignees
Labels
bug Something isn't working

Comments

@mirenradia
Copy link
Member

mirenradia commented Feb 27, 2024

Summary

Like #46 for Intel GPUs, there seems to be problems with our unit tests on AMD GPUs at least on the version of ROCm (5.1.0) and specific GPU that I tried (I think MI210).

Steps to reproduce

Here are some steps to reproduce on the AMD GPU nodes on COSMA.

  1. SSH to COSMA8
  2. Clone AMReX:
    git clone https://github.com/AMReX-Codes/amrex.git
  3. Clone this repo:
    git clone https://github.com/GRTLCollaboration/GRTeclyn.git
  4. SSH to ga005 (which is running ROCm v5.1.0 at the time of writing)
  5. ROCm 5.1.0 is already in the system path so there is no need to load any modules for it but it is necessary to load a newer version of GCC and tell hipcc to use it. First load the module
    module load gnu_comp/11.1.0
  6. Add the following lines to ~/amrex/Tools/GNUMake/Make.local (create this file if it doesn't exist):
    ifeq ($(USE_HIP),TRUE)
      GCC_PATH := $(shell realpath -m $(shell which gcc)/../..)
      GCC_VERSION := $(notdir $(GCC_PATH))
      CXXFLAGS += --gcc-toolchain=$(GCC_PATH)
      SYSTEM_INCLUDE_LOCATIONS += $(GCC_PATH)/include/c++/$(GCC_VERSION)
      AMREX_AMD_ARCH = gfx90a
    endif
  7. Change into the Tests directory:
    cd ~/GRTeclyn/Tests
  8. Build with USE_HIP=TRUE:
    make -j 128 USE_HIP=TRUE
  9. Run the tests
    make run USE_HIP=TRUE

Observed outcome

The tests abort with the following error:

:0:rocdevice.cpp            :2614: 1823894239190 us: 124592: [tid:0x2b2d6af70700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
SIGABRT
See Backtrace.0 file for details

Additional information

Passing -DCATCH_CONFIG_NO_COUNTER to the preprocessor which changes the way Catch2 internally names test cases uniquely from using the __COUNTER__ predefined macro to the __LINE__ one allows the tests to work (although the CCZ4 RHS test currently fails as the tolerances are currently too small).

Since the test cases are currently in different translation units (i.e. different .cpp files) which means __COUNTER__ ends up being 0 for each of them, the "unique" names are not unique and the ROCm linker has difficulty figuring out which device code to link to each kernel in test cases. Since TEST_CASE is currently on a different line for each test (although we can't guarantee that going forward), __LINE__ ends up being different and the test cases are uniquely named.

@mirenradia mirenradia added the bug Something isn't working label Feb 27, 2024
@mirenradia mirenradia self-assigned this Feb 27, 2024
@mirenradia
Copy link
Member Author

The tests work on b340cd1 but the CCZ4 RHS one fails as the tolerances are too tight.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant