-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unit tests do not work on Intel GPUs #46
Comments
Unfortunately switching to doctest (#47) does not seem to completely solve this. On b340cd1, the tests still prematurely abort with the same error code. However, they pass with Playing around with the optimization flags set in amrex/Tools/GNUMake/comps/dpcpp.make:L35 gives the change from |
Running this through ze_tracer -c --demangle on b340cd1 gives the following Level Zero call that fails:
|
Using the Enabling shader dumps shows that 10 different |
Since there is a new software stack on Dawn ( |
With the latest software stack (oneAPI 2025.0) and the update to the drivers and firmware on Dawn, I think this issue is resolved for me now. I have the following modules loaded
|
Summary
The unit tests do not work on Intel GPUs. I think there seems to be some incompatibility between Catch2 and the Intel DPC++ compiler's SYCL implementation.
Steps to reproduce
Here are some steps to reproduce on Dawn at its state on 2024-02-26 (likely to change soon):
USE_SYCL=TRUE
:Observed outcome
The tests abort with the following error:
Expected outcome
The tests should work without issues.
Additional information
I think there are several compounding issues here:
The Intel DPC++ compiler has trouble linking unnamed device kernels in Catch2 test cases. See [SYCL] Name mangling of unnamed kernels not sufficiently robust to disambiguate them intel/llvm#10659.
Even if I remove all but one of the test cases with
amrex::ParallelFor
s such as the "CCZ4 RHS" test case by modifyingtest_dirs
in Tests/GNUMakefile to just(this test contains multiple
amrex::ParallelFor
s), I get the following errorInterestingly, building with
DEBUG=TRUE
does allow this single test to pass.Intel's Level Zero runtime uses
SIGSEGV
(i.e. a segfault) to trigger migration of memory between host and device when using USM shared allocations. We need to disable Catch2's POSIX signal handling to get around this (note that this is automatically done for AMReX'sSIGSEGV
handling when running on Intel GPU's - see here). It should be sufficient to simply define the macroCATCH_CONFIG_NO_POSIX_SIGNALS
.The text was updated successfully, but these errors were encountered: