BinaryBH example: Outputs of GPU runs differ from those of CPU runs #77

julianakwan · 2024-11-30T20:30:16Z

There is some discrepancy between the outputs of the Binary BH example run using the parameter file params_test.txt with CUDA, SYCL and CPU builds. Some of differences are quite large and since params_test.txt is the basis of our regression test test, I've commented out the final line in .gitlab-ci.yml comparing the plotfiles from the A100 build to those in the .github/workflows/data directory.

Differences observed on Wilkes (A100 build)
Here are the differences between the GRTeclyn outputs built using CUDA on Wilkes3 and .github/workflows/data/plt00008_compare using params_test.txt

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 chi                                  0.02993966075             0.03992741451
 h11                                 0.004593910033            0.004540568661
 h12                                 0.005260395774              0.2539277143
 h13                                 0.002278114277              0.3814838726
 h22                                 0.007705697911            0.007456249062
 h23                                 0.003196238505              0.1589191275
 h33                                 0.003986811662            0.003941745199
 K                                      0.120704052               1.000055711
 A11                                  0.01032765309               0.147299736
 A12                                 0.007689020127             0.09536001707
 A13                                 0.003482179618              0.0766795234
 A22                                  0.01076477587              0.0844108539
 A23                                 0.006418734019             0.08121797769
 A33                                 0.005525804435             0.08225255259
 Theta                                0.07312078004              0.6403893793
 Gamma1                              0.004074200533             0.09193938941
 Gamma2                               0.01386288378              0.1008186981
 Gamma3                               0.01831955093               0.441101293
 lapse                                0.02095200077             0.02421934145
 shift1                             0.0004815617707             0.06468082353
 shift2                               0.00181913122             0.07245613423
 shift3                              0.003163369487              0.4454353792
 B1                                  0.003555939314              0.1034092653
 B2                                   0.01143737548              0.1099456271
 B3                                   0.01410172495              0.4398213061
 Ham                                    1.784852899              0.6181094845
 Mom1                                  0.1686440519              0.9854325417
 Mom2                                   0.177850683              0.9376108368
 Mom3                                  0.1654757596               1.000914619
 Weyl4_Re                              0.1009987194              0.2853958764
 Weyl4_Im                              0.1045289952              0.2702705423

How to reproduce

In an interactive job on a Wilkes node, load these modules (these are the same as for the current version of .gitlab-ci.yml) :

Currently Loaded Modulefiles:
 1) rhel8/slurm                               17) gcc/9.4.0/gcc-11.2.0-72sgv5z  
 2) rhel8/global                              
 3) libpciaccess/0.16/gcc-9.4.0-6fonbj6       
 4) libiconv/1.16/gcc-9.4.0-ahebbov           
 5) libxml2/2.9.12/gcc-9.4.0-gnknt5e          
 6) ncurses/6.2/gcc-9.4.0-aiirok7             
 7) hwloc/2.5.0/gcc-9.4.0-7sqomga             
 8) libevent/2.1.12/gcc-9.4.0-hgny7cm         
 9) numactl/2.0.14/gcc-9.4.0-52dwc6n          
10) cuda/11.4.0/gcc-9.4.0-3hnxhjt             
11) gdrcopy/2.2/gcc-9.4.0-e4igtfp             
12) knem/1.1.4/gcc-9.4.0-bpbxgva              
13) libnl/3.3.0/gcc-9.4.0-whwhrwb             
14) rdma-core/34.0/gcc-9.4.0-5eo5n2u          
15) ucx/1.11.1/gcc-9.4.0-lktqyl4              
16) openmpi/4.1.1/gcc-9.4.0-epagguv(default)

Using a fresh pull from the develop branch of GRTeclyn, navigate to Examples/BinaryBH and create the executable:

make -j 8 USE_CUDA=TRUE CUDA_ARCH=80 COMP=gnu DEBUG=TRUE TEST=TRUE USE_ASSERTION=TRUE

Then to run the build:

mpirun  -np 1 ./main3d.gnu.DEBUG.MPI.CUDA.ex ./params_test.txt

This will give you a plotfile called plt00008. You can then compare this with the one we currently use for regression testing (but for CPU builds only) using fcompare:

./fcompare.intel-llvm.ex ~/GRTeclyn/Examples/BinaryBH/plt00008 ~/GRTeclyn/.github/workflows/data/plt00008_compare/

(NB: at this stage, I was no longer on the compute node and back on a login node, so the Intel build of fcompare is appropriate.)

Differences observed on Dawn (SYCL build)

Here are the differences between the GRTeclyn outputs built using SYCL on Dawn and .github/workflows/data/plt00008_compare using params_test.txt

           variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 chi                                 0.002522023382            0.003363307219
 h11                                 0.000616133202           0.0006089790897
 h12                                 0.001469214953             0.07095389732
 h13                                0.0002266257026             0.03794503358
 h22                                 0.001334773991            0.001291563366
 h23                                0.0008048223489             0.04001572589
 h33                                0.0008191868137           0.0008099259595
 K                                    0.01856748466              0.1559801292
 A11                                 0.002165240508             0.03088353921
 A12                                 0.001124639468             0.01394802776
 A13                                0.0004207058565            0.009264173376
 A22                                 0.001589257866             0.01246210072
 A23                                0.0005927683912            0.007500380873
 A33                                 0.001109900984             0.01652120292
 Theta                                0.01340625562              0.1174081813
 Gamma1                              0.001634615371             0.03689101222
 Gamma2                              0.001988390977             0.01446064657
 Gamma3                              0.000890861672             0.02145051631
 lapse                               0.003555215522            0.004109598022
 shift1                             0.0002160420102             0.02902772602
 shift2                             0.0002296623732            0.009147379589
 shift3                             5.327206302e-05            0.007501334079
 B1                                  0.001346558416             0.03916026687
 B2                                  0.001694649995             0.01629036206
 B3                                 0.0008750047986             0.02729093138
 Ham                                   0.1251418186             0.04333795983
 Mom1                                 0.02422399215              0.1564597094
 Mom2                                 0.03252644436              0.1714879084
 Mom3                                 0.01697456918              0.1092753333
 Weyl4_Re                             0.01096920156             0.03099600228
 Weyl4_Im                             0.00388211932             0.01003759544

How to reproduce

Submit an interactive job on Dawn, then load these modules on a compute node:

module list
Currently Loaded Modulefiles:
 1) dawn-env/2023-12-22(default)                  
 2) dawn-env/2024-04-15                           
 3) rhel8/global                                  
 4) rhel8/slurm                                   
 5) default-dawn                                  
 6) gcc-runtime/13.2.0/gcc/ayevhr77               
 7) zlib-ng/2.1.6/gcc/thn3ikgx                    
 8) zstd/1.5.5/gcc/7o3rooli                       
 9) binutils/2.42/gcc/s65uixqt                    
10) intel-oneapi-compilers/2024.1.0/gcc/wadpqv2p  
11) intel-oneapi-mpi/2021.12.0/oneapi/nbxgtwyb    
12) intel-oneapi-tbb/2021.12.0/oneapi/pvsbvzxn    
13) intel-oneapi-mkl/2024.1.0/oneapi/xps7uyz6

Then in the Examples/BinaryBH directory, make the binary black hole example:

make -j 24 USE_SYCL=TRUE SYCL_AOT=TRUE SYCL_PARALLEL_LINK_JOBS=24 AMREX_INTEL_ARCH=pvc

Finally, to run the example:

mpiexec -bootstrap fork -np 1 ./main3d.sycl.MPI.ex  params_profile.txt

Again, fcompare can be run on the outputs to produce the above result.

The text was updated successfully, but these errors were encountered:

mirenradia · 2024-12-02T14:13:25Z

Good spot, @julianakwan (although I do feel a bit stupid for not having noticed this before). I can reproduce this with a CUDA build on the A100s.

I believe the issue comes from the MultiFab ParallelFors being non-blocking so it is necessary to call amrex::Gpu::streamSynchronize() after them. I tried adding these in a few places I thought they might be needed in 96a55d4 and it seems to resolve the problem with CUDA on the A100s. Here are the differences I get after this change

            variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 chi                                4.440892099e-16           5.922356338e-16
 h11                                4.440892099e-16           4.389327527e-16
 h12                                9.454242944e-17           4.563714222e-15
 h13                                9.221139477e-17           1.544135003e-14
 h22                                4.440892099e-16           4.297131544e-16
 h23                                1.010476425e-16            5.02415673e-15
 h33                                4.440892099e-16           4.390692822e-16
 K                                  6.938893904e-15           5.829137141e-14
 A11                                1.081816928e-15           1.542957985e-14
 A12                                3.603888021e-16           4.469578924e-15
 A13                                4.000706016e-16           8.809776178e-15
 A22                                1.194790794e-15           9.368825974e-15
 A23                                3.346932106e-16           4.234963723e-15
 A33                                1.154892154e-15           1.719076909e-14
 Theta                              3.055390142e-15            2.67590061e-14
 Gamma1                             8.868367195e-16           2.001257077e-14
 Gamma2                             8.737585308e-16            6.35446403e-15
 Gamma3                             9.772727332e-16           2.353094051e-14
 lapse                              4.440892099e-16           5.133422971e-16
 shift1                             7.532494593e-17           1.011724732e-14
 shift2                             9.589768216e-17           3.819611927e-15
 shift3                             9.542334371e-17           1.343660092e-14
 B1                                 8.499738456e-16           2.471784897e-14
 B2                                 7.774271678e-16           7.473280576e-15
 B3                                  9.52200558e-16           2.969835893e-14
 Ham                                1.033895192e-13           3.580465506e-14
 Mom1                               1.449187992e-14           9.360038875e-14
 Mom2                               1.415534356e-14           7.462554224e-14
 Mom3                                1.29236899e-14           8.319493959e-14
 Weyl4_Re                            7.24160315e-15            2.04628702e-14
 Weyl4_Im                           6.839364144e-15            1.76838843e-14

I haven't had time to work out if all of them are necessary. Given synchronization is expensive, we should only add in these calls where they are necessary.

julianakwan · 2024-12-06T12:53:50Z

Thanks a lot for fixing this @mirenradia! There was only one extra Gpu::streamSynchronize that was necessary! I also managed to remove one from GRAMRLevel because Gpu::streamSynchronize is called immediately before in specificEvalRHS.

Here are the results of fcompare if you build and run the version of the BinaryBH example in the test/binarybh_add_stream_synchronize branch

 variable name            absolute error            relative error
                                        (||A - B||)         (||A - B||/||A||)
 ----------------------------------------------------------------------------
 level = 0
 chi                                4.440892099e-16           5.922356338e-16
 h11                                4.440892099e-16           4.389327527e-16
 h12                                9.110008754e-17           4.397546875e-15
 h13                                8.229094489e-17           1.378011131e-14
 h22                                4.440892099e-16           4.297131544e-16
 h23                                9.020562075e-17           4.485084119e-15
 h33                                4.440892099e-16           4.390692822e-16
 K                                   5.14518983e-15            4.32230519e-14
 A11                                1.076829598e-15           1.535844729e-14
 A12                                 2.94083063e-16            3.64724834e-15
 A13                                3.123315408e-16           6.877713476e-15
 A22                                9.868408174e-16           7.738208168e-15
 A23                                3.165870344e-16           4.005861377e-15
 A33                                9.079108992e-16           1.351441047e-14
 Theta                              2.444252483e-15           2.140668263e-14
 Gamma1                             6.557254739e-16            1.47972588e-14
 Gamma2                             7.394258816e-16           5.377521365e-15
 Gamma3                               7.9722741e-16           1.919577833e-14
 lapse                              4.440892099e-16           5.133422971e-16
 shift1                             6.861644499e-17           9.216197049e-15
 shift2                             6.434739894e-17           2.562961762e-15
 shift3                             5.921776741e-17           8.338478587e-15
 B1                                 6.221152066e-16           1.809155634e-14
 B2                                 7.124834576e-16           6.848987282e-15
 B3                                 7.590363884e-16             2.3673726e-14
 Ham                                1.060471155e-13            3.67250029e-14
 Mom1                               1.242755898e-14           8.026731925e-14
 Mom2                               1.100562783e-14           5.802055888e-14
 Mom3                               1.112911846e-14           7.164256845e-14
 Weyl4_Re                           7.087212761e-15           2.002660347e-14
 Weyl4_Im                           6.474638534e-15           1.674084846e-14

I am getting some errors in the SYCL build of our GPU workflow, so I will try to fix those first before starting a PR.

julianakwan · 2024-12-07T18:31:36Z

Our SYCL build is failing when using the latest version of the oneAPI compiler (oneAPI DPC++/C++ Compiler 2025.0.1) so I rebased this branch onto develop in which we had frozen the oneAPI compiler to the 2024.2 version.

I will open a separate issue for fixing the SYCL build with the 2025.0 compiler

mirenradia · 2024-12-10T15:25:52Z

I don't think there is a problem with oneAPI 2025.0. At least it worked for me on Dawn (see #46 (comment)).

julianakwan added the bug Something isn't working label Nov 30, 2024

julianakwan mentioned this issue Dec 7, 2024

Fix missing Gpu::streamSynchronizes #78

Closed

mirenradia linked a pull request Dec 10, 2024 that will close this issue

Fix missing Gpu::streamSynchronizes #78

Closed

mirenradia mentioned this issue Dec 20, 2024

Add missing Gpu::streamSynchronizes #79

Merged

mirenradia linked a pull request Dec 23, 2024 that will close this issue

Add missing Gpu::streamSynchronizes #79

Merged

julianakwan closed this as completed in #79 Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BinaryBH example: Outputs of GPU runs differ from those of CPU runs #77

BinaryBH example: Outputs of GPU runs differ from those of CPU runs #77

julianakwan commented Nov 30, 2024

mirenradia commented Dec 2, 2024

julianakwan commented Dec 6, 2024

julianakwan commented Dec 7, 2024

mirenradia commented Dec 10, 2024

BinaryBH example: Outputs of GPU runs differ from those of CPU runs #77

BinaryBH example: Outputs of GPU runs differ from those of CPU runs #77

Comments

julianakwan commented Nov 30, 2024

mirenradia commented Dec 2, 2024

julianakwan commented Dec 6, 2024

julianakwan commented Dec 7, 2024

mirenradia commented Dec 10, 2024