Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit tests do not work on Intel GPUs #46

Closed
mirenradia opened this issue Feb 26, 2024 · 5 comments
Closed

Unit tests do not work on Intel GPUs #46

mirenradia opened this issue Feb 26, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@mirenradia
Copy link
Member

mirenradia commented Feb 26, 2024

Summary

The unit tests do not work on Intel GPUs. I think there seems to be some incompatibility between Catch2 and the Intel DPC++ compiler's SYCL implementation.

Steps to reproduce

Here are some steps to reproduce on Dawn at its state on 2024-02-26 (likely to change soon):

  1. SSH to Dawn
  2. Clone AMReX:
    git clone https://github.com/AMReX-Codes/amrex.git
  3. Clone this repository:
    git clone https://github.com/GRTLCollaboration/GRTeclyn.git
  4. Start an interactive job (it is not currently possible to build on the login nodes).
  5. Load the required Intel modules:
    module load intel-oneapi-compilers/2024.0.0/gcc/znjudqsi intel-oneapi-mkl/2024.0.0/oneapi/4n7ruz44
  6. Change into the Tests directory:
    cd GRTeclyn/Tests
    
  7. Build with USE_SYCL=TRUE:
    make -j <num jobs>  USE_SYCL=TRUE
  8. Run the tests
    ./Tests3d.sycl.ex

Observed outcome

The tests abort with the following error:

amrex::Abort::0::ParallelFor: Native API failed. Native API returns: -30 (PI_ERROR_INVALID_VALUE) -30 (PI_ERROR_INVALID_VALUE)!!!!! !!!
SIGABRT
See Backtrace.0 file for details

Expected outcome

The tests should work without issues.

Additional information

I think there are several compounding issues here:

  1. The Intel DPC++ compiler has trouble linking unnamed device kernels in Catch2 test cases. See [SYCL] Name mangling of unnamed kernels not sufficiently robust to disambiguate them intel/llvm#10659.

    Even if I remove all but one of the test cases with amrex::ParallelFors such as the "CCZ4 RHS" test case by modifying test_dirs in Tests/GNUMakefile to just

    test_dirs = $(GRTECLYN_TESTS_HOME)/Common \
                $(GRTECLYN_TESTS_HOME)/CCZ4RHSTest

    (this test contains multiple amrex::ParallelFors), I get the following error

    amrex::Abort::0::ParallelFor: Native API failed. Native API returns: -46 (PI_ERROR_INVALID_KERNEL_NAME)!!!!! !!!
    SIGABRT
    See Backtrace.0 file for details
    

    Interestingly, building with DEBUG=TRUE does allow this single test to pass.

  2. Intel's Level Zero runtime uses SIGSEGV (i.e. a segfault) to trigger migration of memory between host and device when using USM shared allocations. We need to disable Catch2's POSIX signal handling to get around this (note that this is automatically done for AMReX's SIGSEGV handling when running on Intel GPU's - see here). It should be sufficient to simply define the macro CATCH_CONFIG_NO_POSIX_SIGNALS.

@mirenradia mirenradia added the bug Something isn't working label Feb 26, 2024
@mirenradia mirenradia self-assigned this Feb 26, 2024
@mirenradia
Copy link
Member Author

mirenradia commented Feb 28, 2024

Unfortunately switching to doctest (#47) does not seem to completely solve this. On b340cd1, the tests still prematurely abort with the same error code. However, they pass with DEBUG=TRUE.

Playing around with the optimization flags set in amrex/Tools/GNUMake/comps/dpcpp.make:L35 gives the change from -O1 to -O2 as the culprit (i.e. tests pass with -O1 and prematurely abort with -O2). I believe -O2 also implies -fvectorize -fslp-vectorize -fsycl-dead-args-optimization (using this stackoverflow answer) but the tests pass with -O1 -fvectorize -fslp-vectorize -fsycl-dead-args-optimization and fail with -O2 suggesting it must be something else enabled with -O2...

@mirenradia
Copy link
Member Author

Running this through ze_tracer -c --demangle on b340cd1 gives the following Level Zero call that fails:

>>>> [17719443688] zeKernelCreate: hModule = 0x7353da0 desc = 0x7ffee80ad960 {ZE_STRUCTURE_TYPE_KERNEL_DESC(0x1d) 0 0 "_ZTSZZN5amrex11ParallelForILi256EZ17run_ccz4_rhs_testvEUliiiE0_EEvRKNS_3Gpu10KernelInfoERKNS_3BoxEOT0_ENKUlRN4sycl3_V17handlerEE0_clESE_EUlNSC_7nd_itemILi1EEEE_" (amrex::ParallelFor<256, run_ccz4_rhs_test()::{lambda(int, int, int)#2}>(amrex::Gpu::KernelInfo const&, amrex::Box const&, run_ccz4_rhs_test()::{lambda(int, int, int)#2}&&)::{lambda(sycl::_V1::handler&)#2}::operator()(sycl::_V1::handler&) const::{lambda(sycl::_V1::nd_item<1>)#1})} phKernel = 0x7ffee80ad958 (hKernel = 0x1545e1c8f0d0)
<<<< [17719467271] zeKernelCreate [2270 ns] -> ZE_RESULT_ERROR_INVALID_KERNEL_NAME(0x2013265937)

@mirenradia
Copy link
Member Author

Using the -save-temps option and inspecting the integration headers and preprocessed *.ii files shows no difference between -O1 and -O2 other than the short hash.

Enabling shader dumps shows that 10 different *.spv files (SPIR-V binaries) are created with -O1 but only 3 are created for -O2 so it seems some device kernels are not being compiled in the latter case.

@mirenradia
Copy link
Member Author

mirenradia commented Apr 17, 2024

Since there is a new software stack on Dawn (dawn-env/2024-04-15) with oneAPI 2024.1, I thought I would try this again but I still run into the same problem.

@mirenradia
Copy link
Member Author

With the latest software stack (oneAPI 2025.0) and the update to the drivers and firmware on Dawn, I think this issue is resolved for me now. I have the following modules loaded

 1) dot                            9) gcc-runtime/14.2.0/gcc/w62p4k2j  17) intel-oneapi-compilers/2025.0.3/gcc/sb5vj5us   
 2) dawn-env/2023-12-22(default)  10) zlib-ng/2.2.1/gcc/driavgwo       18) intel-oneapi-tbb/2022.0.0/oneapi/ocxz7f5f      
 3) dawn-env/2024-04-15           11) zstd/1.5.6/gcc/wiebw4lq          19) intel-oneapi-mkl/2025.0.1/oneapi/zkanytru      
 4) dawn-env/2024-12-01           12) binutils/2.43.1/gcc/urc5jbqg     20) intel-oneapi-mpi/2021.14.1/oneapi/mdnh2zdp     
 5) rhel8/global                  13) gcc-runtime/13.2.0/gcc/ayevhr77  21) intel-oneapi-dpl/2022.7.1/oneapi/jfrogdbt      
 6) rhel8/slurm                   14) zlib-ng/2.1.6/gcc/thn3ikgx       22) intel-oneapi-inspector/2024.1.0/oneapi/2xp3byzi
 7) default-dawn                  15) zstd/1.5.5/gcc/7o3rooli
 8) glibc/2.28/gcc/olqvxojx       16) binutils/2.42/gcc/s65uixqt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant