You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Like #46 for Intel GPUs, there seems to be problems with our unit tests on AMD GPUs at least on the version of ROCm (5.1.0) and specific GPU that I tried (I think MI210).
SSH to ga005 (which is running ROCm v5.1.0 at the time of writing)
ROCm 5.1.0 is already in the system path so there is no need to load any modules for it but it is necessary to load a newer version of GCC and tell hipcc to use it. First load the module
module load gnu_comp/11.1.0
Add the following lines to ~/amrex/Tools/GNUMake/Make.local (create this file if it doesn't exist):
:0:rocdevice.cpp :2614: 1823894239190 us: 124592: [tid:0x2b2d6af70700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
SIGABRT
See Backtrace.0 file for details
Additional information
Passing -DCATCH_CONFIG_NO_COUNTER to the preprocessor which changes the way Catch2 internally names test cases uniquely from using the __COUNTER__ predefined macro to the __LINE__ one allows the tests to work (although the CCZ4 RHS test currently fails as the tolerances are currently too small).
Since the test cases are currently in different translation units (i.e. different .cpp files) which means __COUNTER__ ends up being 0 for each of them, the "unique" names are not unique and the ROCm linker has difficulty figuring out which device code to link to each kernel in test cases. Since TEST_CASE is currently on a different line for each test (although we can't guarantee that going forward), __LINE__ ends up being different and the test cases are uniquely named.
The text was updated successfully, but these errors were encountered:
Summary
Like #46 for Intel GPUs, there seems to be problems with our unit tests on AMD GPUs at least on the version of ROCm (5.1.0) and specific GPU that I tried (I think MI210).
Steps to reproduce
Here are some steps to reproduce on the AMD GPU nodes on COSMA.
ga005
(which is running ROCm v5.1.0 at the time of writing)hipcc
to use it. First load the module~/amrex/Tools/GNUMake/Make.local
(create this file if it doesn't exist):USE_HIP=TRUE
:Observed outcome
The tests abort with the following error:
Additional information
Passing
-DCATCH_CONFIG_NO_COUNTER
to the preprocessor which changes the way Catch2 internally names test cases uniquely from using the__COUNTER__
predefined macro to the__LINE__
one allows the tests to work (although the CCZ4 RHS test currently fails as the tolerances are currently too small).Since the test cases are currently in different translation units (i.e. different
.cpp
files) which means__COUNTER__
ends up being0
for each of them, the "unique" names are not unique and the ROCm linker has difficulty figuring out which device code to link to each kernel in test cases. SinceTEST_CASE
is currently on a different line for each test (although we can't guarantee that going forward),__LINE__
ends up being different and the test cases are uniquely named.The text was updated successfully, but these errors were encountered: