Scaling for unstructured grids using SCOTCH for domain decomposition #879

MatthewMasarik-NOAA · 2023-02-02T22:47:59Z

Describe the bug
Running WW3 with unstructured grids using the SCOTCH mesh/hypergraph partitioning library for MPI domain decomposition, scales to ~2K cores, grid size dependent. Above this core count WW3 will fail during model initialization.

This behavior was found during scaling simulations in which allowable resources are ~8K cores. Experiments for two separate mesh's: unst1 = ~0.5M nodes, unst2 = ~1.8M nodes, were conducted on hera. I was unable to run the same experiments on another HPC machine (there are ongoing issues with building WW3/SCOTCH on orion, and SCOTCH is currently not available on WCOSS2, which are the machines I have access to).

Note: ParMetis, which is the partitioning library SCOTCH is replacing, was able to scale out to ~8K cores for each of the grids.

To Reproduce

Build SCOTCH
Build WW3 with SCOTCH
Run executable with cores counts (= MPI tasks) >~2K

Expected behavior
WW3 will error and core dump.

SCOTCH build instructions for (Intel) hera

# https://gitlab.inria.fr/scotch/scotch.git

cd scotch

module purge
module load cmake/3.20.1
module load intel/2022.1.2
module load impi/2022.1.2
module use  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load hdf5/1.10.6
module load netcdf/4.7.4
module load gnu/9.2.0

mkdir build && cd build
cmake -DCMAKE_Fortran_COMPILER=ifort            \
      -DCMAKE_C_COMPILER=icc                    \
      -DCMAKE_INSTALL_PREFIX=<path-to>/install  \
      -DCMAKE_BUILD_TYPE=Release ..             |& tee cmake.out
make  VERBOSE=1                                 |& tee make.out
make  install

Screenshots

hera environment used (job card)

#SBATCH -q batch                  
#SBATCH -t 08:00:00               
#SBATCH --cpus-per-task=1         
#SBATCH -n 2400                   
#SBATCH --exclusive

  module purge                                                                                               
  module load cmake/3.20.1                                                                                   
  module load intel/2022.1.2                                                                                 
  module load impi/2022.1.2                                                                                  
  module use  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack               
  module load hpc/1.2.0                                                                                      
  module load hpc-intel/2022.1.2                                                                             
  module load hpc-impi/2022.1.2                                                                              
  module load jasper/2.0.25                                                                                  
  module load zlib/1.2.11                                                                                    
  module load libpng/1.6.37                                                                                  
  module load hdf5/1.10.6                                                                                    
  module load netcdf/4.7.4                                                                                   
  module load bacio/2.4.1                                                                                    
  module load g2/3.4.5                                                                                       
  module load w3emc/2.9.2                                                                                    
  module load esmf/8.3.0b09                                                                                  
  module load gnu/9.2.0                                                                                      
  export SCOTCH_PATH=/scratch1/NCEPDEV/climate/Matthew.Masarik/waves/opt/hpc-stack/scotch/install            
                                                                                                             
  ulimit -s unlimited                                                                                        
  ulimit -c 0                                                                                                
  export KMP_STACKSIZE=2G                                                                                    
  export FI_OFI_RXM_BUFFER_SIZE=128000                                                                       
  export FI_OFI_RXM_RX_SIZE=64000                                                                            
                                                                                                             
  export OMP_NUM_THREADS=1

log output / error message
Results from the two grids mentioned above. Both were run separately using SCOTCH, and ParMetis for decomposition.
- unst1
  - SCOTCH: scaled to ~1800 cores.
  - ParMetis: scaled through the allowable range, 8K cores.
- unst2
  - SCOTCH: scaled to ~2200 cores.
  - ParMetis: scaled through allowable range, 8K cores.
    The plot below shows this behavior.

Additional context
This stems from current PR #849.

This issue is intended to be a place we can all collect information. @aliabdolali @aronroland please share any information you've learned working on this topic.

TODO

For the unstructured meshes, OMP Threads cannot be used. However, to test if this is memory related, I will be re-running the experiments and adding more than one cpu-per-task, with one thread, to provide more more memory per task.
Another possible detail. The serial version of Intel compilers (ifort/icc) are passed to cmake. I believe from the variable names this is correct, though I'm not 100% certain that they shouldn't be the MPI wrapper names, mpiifort/mpiicc.

The text was updated successfully, but these errors were encountered:

aronroland · 2023-02-02T23:50:27Z

Hi Matthew, so it seems that your r running out of memory as u have anticipated. Memcheck option should show the issues moreover debug compile flags should be used to check what is happening. The point is that u can inquire the memory usage by the sysadmin since it is know how much memory this job has taken. I just run in the same issue on the DATAMOR and the error was also the same, the reason was not enough memory. Now, normally memory usage should go down by number of cores, however it seems scotch there is some more memory usage. I will help soon on the issue but need now 1st to get a significant amount of pull request inline. Cheers and many thanks for this precise and nice inside of this problem. Aron Von: Matthew Masarik ***@***.***> Gesendet: Donnerstag, 2. Februar 2023 23:48 An: NOAA-EMC/WW3 ***@***.***> Cc: Aron Roland ***@***.***>; Mention ***@***.***> Betreff: [NOAA-EMC/WW3] Scaling for unstructured grids using SCOTCH for domain decomposition (Issue #879) Describe the bug Running WW3 with unstructured grids using the SCOTCH <https://gitlab.inria.fr/scotch/scotch> mesh/hypergraph partitioning library for MPI domain decomposition, scales to ~2K cores, grid size dependent. Above this core count WW3 will fail during model initialization. This behavior was found during scaling simulations in which allowable resources are ~8K cores. Experiments for two separate mesh's: unst1 = ~0.5M nodes, unst2 = ~1.8M nodes, were conducted on hera. I was unable to run the same experiments on another HPC machine (there are ongoing issues with building WW3/SCOTCH on orion, and SCOTCH is currently not available on WCOSS2, which are the machines I have access to). Note: ParMetis <https://github.com/KarypisLab/ParMETIS> , which is the partitioning library SCOTCH is replacing, was able to scale out to ~8K cores for each of the grids. To Reproduce 1. Build SCOTCH 2. Build WW3 with SCOTCH 3. Run executable with cores counts (= MPI tasks) >~2K Expected behavior WW3 will error and core dump. * SCOTCH build instructions for (Intel) hera # https://gitlab.inria.fr/scotch/scotch.git cd scotch module purge module load cmake/3.20.1 module load intel/2022.1.2 module load impi/2022.1.2 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2 module load hdf5/1.10.6 module load netcdf/4.7.4 module load gnu/9.2.0 mkdir build && cd build cmake -DCMAKE_Fortran_COMPILER=ifort \ -DCMAKE_C_COMPILER=icc \ -DCMAKE_INSTALL_PREFIX=<path-to>/install \ -DCMAKE_BUILD_TYPE=Release .. |& tee cmake.out make VERBOSE=1 |& tee make.out make install Screenshots * hera environment used (job card) #SBATCH -q batch #SBATCH -t 08:00:00 #SBATCH --cpus-per-task=1 #SBATCH -n 2400 #SBATCH --exclusive module purge module load cmake/3.20.1 module load intel/2022.1.2 module load impi/2022.1.2 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2 module load jasper/2.0.25 module load zlib/1.2.11 module load libpng/1.6.37 module load hdf5/1.10.6 module load netcdf/4.7.4 module load bacio/2.4.1 module load g2/3.4.5 module load w3emc/2.9.2 module load esmf/8.3.0b09 module load gnu/9.2.0 export SCOTCH_PATH=/scratch1/NCEPDEV/climate/Matthew.Masarik/waves/opt/hpc-stack/scotch/install ulimit -s unlimited ulimit -c 0 export KMP_STACKSIZE=2G export FI_OFI_RXM_BUFFER_SIZE=128000 export FI_OFI_RXM_RX_SIZE=64000 export OMP_NUM_THREADS=1 * log output / error message <https://user-images.githubusercontent.com/86749872/216459978-52d023fe-6bee-45ab-ba39-04c73af6aee6.png> * Results from the two grids mentioned above. Both were run separately using SCOTCH, and ParMetis for decomposition. * unst1 * SCOTCH: scaled to ~1800 cores. * ParMetis: scaled through the allowable range, 8K cores. * unst2 * SCOTCH: scaled to ~2200 cores. * ParMetis: scaled through allowable range, 8K cores. The plot below shows this behavior. <https://user-images.githubusercontent.com/86749872/216462966-e6722057-6db1-4eba-89e7-5d29d78e399f.png> Additional context This stems from current PR #849 <#849> . This issue is intended to be a place we can all collect information. @aliabdolali <https://github.com/aliabdolali> @aronroland <https://github.com/aronroland> please share any information you've learned working on this topic. TODO * For the unstructured meshes, OMP Threads cannot be used. However, to test if this is memory related, I will be re-running the experiments and adding more than one cpu-per-task, with one thread, to provide more more memory per task. * Another possible detail. The serial version of Intel compilers (ifort/icc) are passed to cmake. I believe from the variable names this is correct, though I'm not 100% certain that they shouldn't be the MPI wrapper names, mpiifort/mpiicc. — Reply to this email directly, view it on GitHub <#879> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2S43QPBLJYQAKUMZUI6PTWVQ2SXANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned. <https://github.com/notifications/beacon/AB2S43RKI2WCXF4D2U66ZKDWVQ2SXA5CNFSM6AAAAAAUPTBXWOWGG33NNVSW45C7OR4XAZNFJFZXG5LFVJRW63LNMVXHIX3JMTHF3A3XTU.gif> Message ID: ***@***.*** ***@***.***> >

MatthewMasarik-NOAA · 2023-02-03T00:42:54Z

Hi Aron,
That is very helpful information. Thank you for sharing your experience. I have those re-runs in the works so I'll report back when I have the results. That would be amazing news if we can fix this just in the job card. Cheers and thanks again for the insight.

aronroland · 2023-02-03T10:24:36Z

Hi Matthew, I have to thank you did a nice job on that. I discussed also with Ali and I think the 1st step should be to check with your admins and memcheck on that. I am sorry that I tied up with the other stuff I would like to jump but then, actually I found that I totally rely on your guys, since I have 450CPU max. and not more! Cheers Aron Von: Matthew Masarik ***@***.***> Gesendet: Freitag, 3. Februar 2023 01:43 An: NOAA-EMC/WW3 ***@***.***> Cc: Aron Roland ***@***.***>; Mention ***@***.***> Betreff: Re: [NOAA-EMC/WW3] Scaling for unstructured grids using SCOTCH for domain decomposition (Issue #879) Hi Aron, That is very helpful information. Thank you for sharing your experience. I have those re-runs in the works so I'll report back when I have the results. That would be amazing news if we can fix this just in the job card. Cheers and thanks again for the insight. — Reply to this email directly, view it on GitHub <#879 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2S43RJPV4KCGSIDP7IUFDWVRIBTANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned. <https://github.com/notifications/beacon/AB2S43X77JDRQI5LJCLMCCLWVRIBTA5CNFSM6AAAAAAUPTBXWOWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSUKB2LQ.gif> Message ID: ***@***.*** ***@***.***> >

JessicaMeixner-NOAA · 2023-02-03T13:24:46Z

@aronroland we understand if you have other priorities, but this is a high priority item for us and we'll continue to work on this. It'd be great if you could include us (@MatthewMasarik-NOAA and myself) on your conversations with the SCOTCH developers on this issue.

MatthewMasarik-NOAA · 2023-02-03T13:28:39Z

I discussed also with Ali and I think the 1st step should be to check with your admins and memcheck on that.

Hi Aron, I was able to get some runs in yesterday and will be able to share the results here this afternoon. Please stay tuned. Thanks

aronroland · 2023-02-03T14:08:40Z

Hi Jessica, Ok this is no problem. As for the schedule this is also super important for me any my work. The only problem that I cannot run myself on such a large core count. Otherwise with respect to scotch, I will write some mail and introduce you to the scotch team. They are maintaining scotch in quite similar way as we do but not on github. Therefore, we need to get access for you there. Once we are as sure as possible about the nature of the problem. We can engage them via their ticketing system. So let me do that for Matthew and u. Cheers Aron Von: Jessica Meixner ***@***.***> Gesendet: Freitag, 3. Februar 2023 14:25 An: NOAA-EMC/WW3 ***@***.***> Cc: Aron Roland ***@***.***>; Mention ***@***.***> Betreff: Re: [NOAA-EMC/WW3] Scaling for unstructured grids using SCOTCH for domain decomposition (Issue #879) @aronroland <https://github.com/aronroland> we understand if you have other priorities, but this is a high priority item for us and we'll continue to work on this. It'd be great if you could include us ***@***.*** <https://github.com/MatthewMasarik-NOAA> and myself) on your conversations with the SCOTCH developers on this issue. — Reply to this email directly, view it on GitHub <#879 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2S43XHRDOCJFF5CRCWE7DWVUBKTANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned. <https://github.com/notifications/beacon/AB2S43VMXZ2OBMRETB7UFUDWVUBKTA5CNFSM6AAAAAAUPTBXWOWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSUMRWVA.gif> Message ID: ***@***.*** ***@***.***> >

MatthewMasarik-NOAA · 2023-02-03T18:56:56Z

Following @aronroland's initial comments above I performed some more scaling runs where I added cores to tasks to increase the memory (though OMP Thread count stays at 1, for unstruct. so, cores != tasks x threads, for the new cases). The new cases ran are cores-per-task=<2,4>, these are the only cases that made sense to me -- cores-per-task=1 are what has already been run, and cores-per-task=6 (or above) will eat up too many cores for memory alone, leaving the corresponding task count to low for performance.

I attempted to run total core counts between 1K -- 8K. All 4-core runs completed. In the 2-core runs the model crashed after ~4K, in the same manner as before. The table gives the parameter details for the highest scaling/best performance for each of the core-per-task cases.

cores-per-task	max tot cores	mpi tasks	min runtime/sim day
1	2200	2200	992sec, ~17min
2	4000	2000	933sec, ~16min
4	8000	2000	903sec, ~15min

aronroland · 2023-02-03T19:09:30Z

Hi Mathew, hmm so memory seems not be the issue otherwise we should pass more active cores using the 4 core per node, right0 … let’s see what the debug flags will show. But it was a really good try. In the best case also the scotch build should be compiled in debug mode. Cheers Aron Von: Matthew Masarik ***@***.***> Gesendet: Freitag, 3. Februar 2023 19:57 An: NOAA-EMC/WW3 ***@***.***> Cc: Aron Roland ***@***.***>; Mention ***@***.***> Betreff: Re: [NOAA-EMC/WW3] Scaling for unstructured grids using SCOTCH for domain decomposition (Issue #879) Following @aronroland <https://github.com/aronroland> 's initial comments above I performed some more scaling runs where I added cores to tasks to increase the memory (though OMP Thread count stays at 1, for unstruct. so, cores != tasks x threads, for the new cases). The new cases ran are cores-per-task=<2,4>, these are the only cases that made sense to me -- cores-per-task=1 are what has already been run, and cores-per-task=6 (or above) will eat up too many cores for memory alone, leaving the corresponding task count to low for performance. I attempted to run total core counts between 1K -- 8K. All 4-core runs completed. In the 2-core runs the model crashed after ~4K, in the same manner as before. The table gives the parameter details for the highest scaling/best performance for each of the core-per-task cases. cores-per-task max tot cores mpi tasks min runtime/sim day 1 2200 2200 992sec, ~17min 2 4000 2000 933sec, ~16min 4 8000 2000 903sec, ~15min <https://user-images.githubusercontent.com/86749872/216684278-1cc270ef-a272-416a-959f-486dd1c16fdc.png> — Reply to this email directly, view it on GitHub <#879 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2S43XUBYLP65OH6IVW3FDWVVIIHANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned. <https://github.com/notifications/beacon/AB2S43RNNWOWNL67MRUPXXDWVVIIHA5CNFSM6AAAAAAUPTBXWOWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSUNKMWS.gif> Message ID: ***@***.***>

MatthewMasarik-NOAA · 2023-02-03T19:18:18Z

Hi @aronroland, quick clarification: i only performed runs with total core requests up to 8K because this is our rough upper limit of projected resources for GFSv17. Runs past 8K for the cores-per-task=4 are unknown.

arunchawla-NOAA · 2023-02-08T22:35:58Z

@MatthewMasarik-NOAA can you go past the 8k cores with cores-per-task=4 and see how far you can push this?

MatthewMasarik-NOAA · 2023-02-08T23:05:28Z

@arunchawla-NOAA yes, I would like be able to try this. The max core request on hera batch queue is 8400, so that is as far as out as I've been able to run. Since with 4 cores/tasks you have 2100 mpi tasks, it is still close enough to ~2K mpi tasks, and that run succeeded. I am going to be working more on acorn tomorrow, but I believe you mentioned that machine just has 4096 cores.

So to push out past 8K cores with cores-per-task=4, I think I would need to submit a request to do a run in hera's novel queue. Is this something I should look into, or is there another avenue to run that high of core counts?

arunchawla-NOAA · 2023-02-09T03:39:42Z

Thanks Matt A few things

I am curious to know how many cores does it take to bring the wave model timing down so how far you can extend will be good to know
so at cores-per-task= 4 you are able to blow past the problems you are having with the library crashing? I still do not understand what the issue with the 2000 mpi tasks is
It will be good to see if you can install the library on acorn and run the model so we can make sure that the problem system is only Orion. If we cannot solve that problem on Orion we will have to reach out to help desk

aliabdolali · 2023-02-09T03:49:12Z

@arunchawla-NOAA @MatthewMasarik-NOAA
Here are my thoughts:

We need to compile SCOTCH with debug flag, as consistent as possible to Debug flags in WW3 and try it. We can then report to SCOTCH developers the outcomes of our simulation with debug flag.
I am not sure about the practical difference between ntasks-per-node and cores-per-task, but based on my experience after years of using unstructured ww3 on various scales and different setup size is ntasks-per-node=20, or 30 out of 40 cores/node can help to have more memory/core if memory is an issue. However, I do not think our problem is a memory issue.
Testing on a different platform like acron would be beneficiary.

JessicaMeixner-NOAA · 2023-02-09T12:52:24Z

@aliabdolali you mentioned 2-3 weeks ago you'd be looking into checking the decomposition and running with more diagnostic output. Any update on that?

@MatthewMasarik-NOAA I'm still not sure what running with the novel queue will tell us that we can't find out other ways, but there's no harm in making that request and doing that run. For the runs that are completing, it'd be interesting to see the memory usage (and perhaps comparing with parametis memory usage) since memory usage still seems to be a theory. I think I lost this, but what happens if we run with 2 threads and more than 2,000mpi tasks? Or does that as well need the novel queue?

aronroland · 2023-02-09T13:19:06Z

Hi All, I have some answer from Francois about this things but we need to have a clear track on the actual status. Since I cannot run it my own I would be happy if somebody would do the following: 1. Run will full debug flags 2. Run with simplified debug flags 3. Run 1. And 2. With “memcheck” option Please provide all error messages that u have and lets do another iteration to have some decent description of the problem for our colleagues. As for the memory issue I am totally with you Mathew but either I must something or I would conclude that u did all needed steps and the we left having a max. core count of about 2k with scotch. Cheers Aron Von: Jessica Meixner ***@***.***> Gesendet: Donnerstag, 9. Februar 2023 13:53 An: NOAA-EMC/WW3 ***@***.***> Cc: Aron Roland ***@***.***>; Mention ***@***.***> Betreff: Re: [NOAA-EMC/WW3] Scaling for unstructured grids using SCOTCH for domain decomposition (Issue #879) @aliabdolali <https://github.com/aliabdolali> you mentioned 2-3 weeks ago you'd be looking into checking the decomposition and running with more diagnostic output. Any update on that? @MatthewMasarik-NOAA <https://github.com/MatthewMasarik-NOAA> I'm still not sure what running with the novel queue will tell us that we can't find out other ways, but there's no harm in making that request and doing that run. For the runs that are completing, it'd be interesting to see the memory usage (and perhaps comparing with parametis memory usage) since memory usage still seems to be a theory. I think I lost this, but what happens if we run with 2 threads and more than 2,000mpi tasks? Or does that as well need the novel queue? — Reply to this email directly, view it on GitHub <#879 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2S43SQAUQA2XVT7B7YSF3WWTSBJANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned. <https://github.com/notifications/beacon/AB2S43WJBFI4B7RB7EUBJULWWTSBJA5CNFSM6AAAAAAUPTBXWOWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSU4LBMA.gif> Message ID: ***@***.*** ***@***.***> >

aronroland · 2023-02-09T13:23:13Z

Hi Arun, I am in close contact with Francois, I am totally with u … we need to make sure that it is not on our end. As we have summed up all information’s I will include everybody in the conversation with the SCOTCH Team so we are all on the same line. Otherwise we should as well have this issue posted on the gitlab side of INRIA to keep a sustainable development, I will do that part. All this was already agreed with Francois, so we are just ready to go as we have all debug etc. … thanks everybody for pushing on that! Cheers Aron Von: Ali.Abdolali ***@***.***> Gesendet: Donnerstag, 9. Februar 2023 04:49 An: NOAA-EMC/WW3 ***@***.***> Cc: Aron Roland ***@***.***>; Mention ***@***.***> Betreff: Re: [NOAA-EMC/WW3] Scaling for unstructured grids using SCOTCH for domain decomposition (Issue #879) @arunchawla-NOAA <https://github.com/arunchawla-NOAA> @MatthewMasarik-NOAA <https://github.com/MatthewMasarik-NOAA> Here are my thoughts: * We need to compile SCOTCH with debug flag, as consistent as possible to Debug flags in WW3 and try it. We can then report to SCOTCH developers the outcomes of our simulation with debug flag. * I am not sure about the practical difference between ntasks-per-node and cores-per-task, but based on my experience after years of using unstructured ww3 on various scales and different setup size is ntasks-per-node=20, or 30 out of 40 cores/node can help to have more memory/core if memory is an issue. However, I do not think our problem is a memory issue. * Testing on a different platform like acron would be beneficiary. — Reply to this email directly, view it on GitHub <#879 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AB2S43VKH34Z7E6WYJG2T3LWWRSMFANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned. <https://github.com/notifications/beacon/AB2S43QQCLYXCLRFADVER5TWWRSMFA5CNFSM6AAAAAAUPTBXWOWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSU3IU4U.gif> Message ID: ***@***.*** ***@***.***> >

aliabdolali · 2023-02-09T14:24:41Z

@aronroland thanks for pushing it further to SCOTCH developers.

MatthewMasarik-NOAA · 2023-02-09T14:55:50Z

@arunchawla-NOAA @JessicaMeixner-NOAA @aliabdolali @aronroland,

Running in debug mode on orion is underway. Also, getting SCOTCH built on acorn and running a canned case there is also highest priority. Regarding the MPI tasks, I want to clarify I've just referred to 2K as roughly where runs start to die. In the case of 4 cores-per-task, I initially ran 8K cores with 2K tasks, and that was successful. I found hera's limit of 8400 cores when I wanted to see far it could be pushed. I did one 4-core run at the 8400 limit (2100 tasks), and that was successful. Since other runs had made it to 2200 mpi tasks before dying, I was not surprised, so that's what I was referring to previously.

I think I lost this, but what happens if we run with 2 threads and more than 2,000mpi tasks?

In this case (2 cores), the runs crash. The next increment of resources I had tried with 2 cores was 2,200mpi tasks. This count (and several other increments between 2200-4000 mpi tasks) all crashed.

aronroland · 2023-02-10T15:48:34Z

@MatthewMasarik-NOAA, @JessicaMeixner-NOAA,

is there any news on the run using the debug flags?

MatthewMasarik-NOAA · 2023-02-10T15:51:55Z

@aronroland I have been digging into it. I can give an update early afternoon.

JessicaMeixner-NOAA · 2023-02-10T15:52:18Z

@aronroland - no news from me, working on debugging building scotch on orion which is the blocking issue for the PR. Any news from you?

aronroland · 2023-02-10T16:00:17Z

@JessicaMeixner-NOAA as I said i cannot run on that many cores since I do not have access to more than 448 cores. I am waiting for the debug part so we can see the nature of the problem? Anything I missed that I should do? By the way before I forgot it would be great of scotch and ww3 could be builld with the same debug flags and if we could have a scotch build for debugging. This would be really helpful. Thanks for your hard work on that issue.

JessicaMeixner-NOAA · 2023-02-10T16:05:58Z

@aronroland do you think it'd be potentially interesting/useful to compare the memory usage between scotch and parmetis even on smaller node counts to see if its vastly different even for smaller number of cores? Still haven't heard anything from @aliabdolali who was going to look into the decomposition and run with extra output, which could give us more information as well.

aliabdolali · 2023-02-10T16:21:55Z

We need SCOTCH outputs with debug, so we can ask SCOTCH developers to take a look. I think Aron and I asked for it days ago. I'd appreciate it if you do it at your earliest convenience, then we can continue.
I do not believe this is a memory issue, but any info would be helpful.

aronroland · 2023-02-10T16:24:08Z

This would be clearly the next step, but, honestly, before we have the debug/traceback it is fishing in the dark. Otherwise i had some issue with OASIS on memory issues and finally the sysadmin from DATARMOR was so kind to give hints on the memory usage, could u check with your sysadmin if he can tell something about that, when looking on the jobID? As u have the debug I go on with the memory examination ...

aronroland · 2023-02-10T17:15:15Z

@JessicaMeixner-NOAA I was thinking more on the memory issue, we cannot see this basically without modifying the scotch code. So that without having deep inside in the memory management of SCOTCH it will not be helpful looking at WW3 since for WW3 the memory does not depend on SCOTCH or PARMETIS.

MatthewMasarik-NOAA · 2023-03-20T17:22:43Z

It would be great if you could make a list of those flags and explain for each of them why they are used and i am curious whether u have tried the flags I have send u? It would be great for us developers to understand your environment at NOAA.

Hi @aronroland, as far as a list of the flags, they are just as they appear in those two lines from the CMakeList.txt file. It may be that we need to look into the flags we are using, though currently so we don't get sidetracked, I understood from Ali (meeting last Thu) that the run done at ERDC out to 8K cores used the standard WW3 cmake compile, so using those flags listed.

aronroland · 2023-03-20T17:37:37Z

It would be great if you could make a list of those flags and explain for each of them why they are used and i am curious whether u have tried the flags I have send u? It would be great for us developers to understand your environment at NOAA.

Hi @aronroland, as far as a list of the flags, they are just as they appear in those two lines from the CMakeList.txt file. It may be that we need to look into the flags we are using, though currently so we don't get sidetracked, I understood from Ali (meeting last Thu) that the run done at ERDC out to 8K cores used the standard WW3 cmake compile, so using those flags listed.

@MatthewMasarik-NOAA about flags used at ERDC u should discuss with @thesser1.

MatthewMasarik-NOAA · 2023-03-20T17:45:57Z

@aronroland, from my conversation with @aliabdolali last Thu he stated that for the particular run in question at ERDC, the standard flags had been used. Do you believe different flags were used? @aliabdolali @thesser1 can either of you confirm if the standard WW3 compile options were used for the 8K core run?

aliabdolali · 2023-03-20T17:51:50Z

From what I recall, Ty compiled SCOTCH the same way I did initially, and tested WW3 with its release flags. But I'll leave it to him to confirm.
I usually use Aron's flags during development and debugging as WW3 standard flags (including debug) usually do not provide insightful info.

thesser1 · 2023-03-21T00:31:25Z

I have run a scotch test with release flags as well as debug and reldebug flags that Aron described. All are working on my machine. If you want the full output from the debug flags, I can provide it. Ty

…

On Mon, Mar 20, 2023 at 1:52 PM Ali.Abdolali ***@***.***> wrote: From what I recall, Ty compiled SCOTCH the same way I did initially, and tested WW3 with its release flags. But I'll leave it to him to confirm. I usually use Aron's flags during development and debugging as WW3 standard flags (including debug) usually do not provide insightful info. — Reply to this email directly, view it on GitHub <#879 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAU2O3ABCTQUK37GTMWPHPLW5CKMDANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

aronroland · 2023-03-22T12:24:58Z

It would be great if you could make a list of those flags and explain for each of them why they are used and i am curious whether u have tried the flags I have send u? It would be great for us developers to understand your environment at NOAA.

Hi @aronroland, as far as a list of the flags, they are just as they appear in those two lines from the CMakeList.txt file. It may be that we need to look into the flags we are using, though currently so we don't get sidetracked, I understood from Ali (meeting last Thu) that the run done at ERDC out to 8K cores used the standard WW3 cmake compile, so using those flags listed.

This just slipped. Let me be frank Matt, u must know exactly which flags are used and why they are used. I think that some of the flags that are used like the i4, 32 bit stuff are a very bad choice. Most likely used to make the model b4b, we will now check if it is b4b using other flags.

MatthewMasarik-NOAA · 2023-03-22T13:49:22Z

This just slipped. Let me be frank Matt, u must know exactly which flags are used and why they are used. I think that some of the flags that are used like the i4, 32 bit stuff are a very bad choice. Most likely used to make the model b4b, we will now check if it is b4b using other flags.

@aronroland, I wholeheartedly agree with your sentiment we should know what flags we are using and why.

I've tried to do some tests using the Intel debug flags you gave in the post, though I'm running into problems with the compile. Here's what I tried. First I tried using just those flags, and no others by removing the standard + debug flags and replacing them with yours. The compile failed. Next I tried putting back in the standard Intel flags, and replacing the debug flags with your flags. This compile also failed. The Intel compiler doesn't like / seem to recognize some of the flags so I am going to try removing those until the compile succeeds. I'll keep you posted. Have you and Ali been successful compiling with those compile flags on any NOAA machines?

aronroland · 2023-03-25T12:02:30Z

Matt, "the compile failed" can u please be more specific u can just paste everything here what u got. I am using always this flags, because I know why I am using them and what for I am using them. Those flags have not been developed by me, they are developed by INTEL with clear purpose (see the intel FORTRAN compiler manual). This I know precisely and therefore I am able to interpret this decently. I would very much appreciate if u share any kind of compiler problems bug's in stdout and my warm suggestion is to not going forward as long we have compilation issues. I wish u would be willing sharing this stdout with us, let me thank u in advance for your precious work. Btw NOAA machine is nothing special, it uses intel cpu's and Mellanox or other IB network, so there is no magic there with "NOAA" being particular with respect to the hardware infrastructure.

arunchawla-NOAA · 2023-03-30T18:06:07Z

Thank you for an excellent meeting today. Here are the main points

When Tyler ran the system (ww3+scotch) using impi, he had failures similar to the ones that NOAA has had. The system works for sgimpt. So the following options have been suggested as the path forward

-- We shall focus all attention on the debug build options so we can have adequate traceback options. Aron will provide the the debug options we should use for WW3 build (over the standard debug options that we have)

-- We should not use the cmake build option for scotch, but use one of the make build options that are available for now for debugging options (Again Aron will direct us to which ones)

-- We shall wait for a newer instrumented version of SCOTCH from Francois (It was not clear if we should use the instrumented version of the code Francois had already provided or he would provide a newer version)

-- EMC will provide traceback error location using the impi library, in debug mode so that we can know exactly where the problems are occurring. EMC will also test with other mpi libraries it has access to

-- Tyler will do the same with his machines. He will test also with the one mpi library that works (sgimpt)

-- Aron will provide options to how we can compile the mpi libraries in debug mode to see if that will provide more information

-- If these options do not provide an indication on where the problem is occurring we will proceed to more detailed debugging using mpi-barriers and print statements

Thank you and please add anything I missed

JessicaMeixner-NOAA · 2023-03-30T18:48:50Z

I tested @MatthewMasarik-NOAA canned case (v7.0.3, not the version just emailed) with intel 18 on hera:

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-intel/18.0.5.274
module load hpc-impi/2018.0.4

and it failed with:

Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(170)......................: MPI_Irecv(buf=0x33f6ad8, count=2, MPI_INT, src=702, tag=300, MPI_COMM_WORLD, request=0x3067bf4) failed
MPIDI_CH3U_Request_unpack_uebuf(618): Message truncated; 296 bytes received but buffer size is 8

JessicaMeixner-NOAA · 2023-03-31T17:43:11Z

Here are some output from running with the various SCOTCH_NOAA_DEBUG flags. It's likely these need to be re-run with additional compiler flags turned on for SCOTCH to get additional traceback information. All results below use the WW3 default debug cmake options and build SCOTCH w/cmake in debug mode Intel 18 (unintentionally changed from above test, will re-run with 2022) and Impi.

Building with:
-DSCOTCH_NOAA_DEBU1=ON
Error:

Invalid count 1
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000154300D  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B6AAE92E630  Unknown               Unknown  Unknown
ww3_shel           00000000014E2159  Unknown               Unknown  Unknown
ww3_shel           00000000014D2C26  Unknown               Unknown  Unknown
ww3_shel           00000000014C8517  Unknown               Unknown  Unknown
ww3_shel           00000000014CA7A4  Unknown               Unknown  Unknown
ww3_shel           00000000014CA9B6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1166  Unknown               Unknown  Unknown
ww3_shel           00000000014C2F93  Unknown               Unknown  Unknown
ww3_shel           00000000014C32E6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1F08  Unknown               Unknown  Unknown
ww3_shel           00000000014C0B8F  Unknown               Unknown  Unknown
ww3_shel           00000000014BCFDE  Unknown               Unknown  Unknown
ww3_shel           00000000014BC1B2  Unknown               Unknown  Unknown
ww3_shel           00000000014BC33B  Unknown               Unknown  Unknown
ww3_shel           00000000014BB5A0  Unknown               Unknown  Unknown
ww3_shel           00000000014BB3F3  Unknown               Unknown  Unknown
ww3_shel           00000000012AC206  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           0000000001299D9A  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010FAEA4  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           000000000089AB7D  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000445E31  MAIN__                   1903  ww3_shel.F90

Building with:
-DSCOTCH_NOAA_DEBUG2=ON
Error:

       Wave model ...
Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(170)......................: MPI_Irecv(buf=0x2bd2c80, count=20, MPI_INT, src=703, tag=300, MPI_COMM_WORLD, request=0x28439f8) failed
MPIDI_CH3U_Request_unpack_uebuf(618): Message truncated; 280 bytes received but buffer size is 80

Building with:
-DSCOTCH_NOAA_DEBUG_2=ON
Error:
Building with:
-DSCOTCH_NOAA_DEBUG_3=ON
Error:

       Wave model ...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           0000000001542FAD  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B2AF0D4C630  Unknown               Unknown  Unknown
ww3_shel           00000000014E1B2F  Unknown               Unknown  Unknown
ww3_shel           00000000014D2C30  Unknown               Unknown  Unknown
ww3_shel           00000000014C8517  Unknown               Unknown  Unknown
ww3_shel           00000000014CA7A4  Unknown               Unknown  Unknown
ww3_shel           00000000014CA9B6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1166  Unknown               Unknown  Unknown
ww3_shel           00000000014C2F93  Unknown               Unknown  Unknown
ww3_shel           00000000014C32E6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1F08  Unknown               Unknown  Unknown
ww3_shel           00000000014C0B8F  Unknown               Unknown  Unknown
ww3_shel           00000000014BCFDE  Unknown               Unknown  Unknown
ww3_shel           00000000014BC1B2  Unknown               Unknown  Unknown
ww3_shel           00000000014BC33B  Unknown               Unknown  Unknown
ww3_shel           00000000014BB5A0  Unknown               Unknown  Unknown
ww3_shel           00000000014BB3F3  Unknown               Unknown  Unknown
ww3_shel           00000000012AC206  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           0000000001299D9A  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010FAEA4  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           000000000089AB7D  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000445E31  MAIN__                   1903  ww3_shel.F90

aronroland · 2023-03-31T18:59:46Z

@JessicaMeixner-NOAA, @MatthewMasarik-NOAA, as we have yesterday agreed I have provided the how-to build scotch in debug, performance, and further instrumentalization are given in #964 in the discussion section using gnu make. This implies of course that this needs to be applied in combination with #927 from the issue section.

JessicaMeixner-NOAA · 2023-03-31T19:05:32Z

@aronroland Thanks for pointing this out, I missed the other thread with the build info despite looking for it. Happy to switch to this new build instructions and build flags. In the meantime I have some updates from running with Intel 2021 that I'll share since those runs are in the queue.

aronroland · 2023-03-31T19:09:21Z

@aronroland Thanks for pointing this out, I missed the other thread with the build info despite looking for it. Happy to switch to this new build instructions and build flags. In the meantime I have some updates from running with Intel 2021 that I'll share since those runs are in the queue.

Hi @JessicaMeixner-NOAA, please correct/modify/add/question anything that is not clear since I really like to unify everything in such a way that it is understandable for everybody and this may be difficult for me since I am deep inside of this and maybe I do not explain this in a way that it is broadly understandable. Thanks for your help in advance.

Saying this I see that the c-flags in the debugging makefile for impi part could be further expanded but I like to have this in the SCOTCH repo. Therefore I will experiment a bit with this part, adjust with the SCOTCH team and provide a further expanded debug makefile for the "c" language using intel compiler and gnu make for SCOTCH. I think that we can go forward with this but expect more Monday.

I was also not sure if the "idea" section is the right place to put but I do not feel that this is like an issue. So feel free to move it anywhere else, where u think it is appropriate. Thanks in advance.

aronroland · 2023-03-31T19:28:47Z

It was asked by @MatthewMasarik-NOAA, which compiler flags we should use when. Thanks for this question. I have extended #927 in order to answer your important question. Please let me know if this helps.

JessicaMeixner-NOAA · 2023-04-03T13:36:20Z

Here is output from building SCOTCH with CMAKE,

intel/2022.1.2
impi/2022.1.2

First run with the following:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_1"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_1"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_1"

Error output:

       Wave model ...
[h25c44:93497:0:93497] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30e2bf0)
==== backtrace (tid:  93497) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155cde __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000567b85 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000567b85 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000567b85 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000567b85 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000567b85 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000567b85 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000567b85 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000567b85 PMPI_Irecv()  /build/impi/_buildspace/release/../../src/mpi/pt2pt/irecv.c:139
12 0x00000000014c993c _SCOTCHdgraphMatchSyncPtop()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_ptop.c:204
13 0x00000000014ba31a _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
14 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
15 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
16 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
17 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
18 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
19 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
20 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
21 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
22 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
23 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
24 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
25 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
26 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
27 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/PDLIB/yowpdlibmain.F90:632
28 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/PDLIB/yowpdlibmain.F90:127
29 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/w3profsmd_pdlib.F90:265
30 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/w3initmd.F90:750
31 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/ww3_shel.F90:1903
32 0x0000000000407ea2 main()  ???:0
33 0x0000000000022555 __libc_start_main()  ???:0
34 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152427A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B2373B33630  Unknown               Unknown  Unknown
libc-2.17.so       00002B2373E95CDE  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002B24C0BA981B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002B24C093BDBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B237264EB85  MPI_Irecv             Unknown  Unknown
ww3_shel           00000000014C993C  Unknown               Unknown  Unknown
ww3_shel           00000000014BA31A  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002B2373D62555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown

For an example of which flags are used in compilation here's a line from the scotch make output:

cd /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/build/src/libscotch && /apps/oneapi/mpi/2021.5.1/bin/mpiicc -DCOMMON_FILE_COMPRESS_BZ2 -DCOMMON_FILE_COMPRESS_GZ -DCOMMON_FILE_COMPRESS_LZMA -DCOMMON_RANDOM_FIXED_SEED -DSCOTCH_DEBUG_LIBRARY1 -DSCOTCH_PATCHLEVEL_NUM=3 -DSCOTCH_RELEASE_NUM=0 -DSCOTCH_RENAME -DSCOTCH_VERSION_NUM=7 -Drestrict=__restrict -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/build/src/libscotch -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/build/src/include -DSCOTCH_NOAA_DEBUG_1 -g -MD -MT src/libscotch/CMakeFiles/scotch.dir/bgraph_bipart_gg.c.o -MF CMakeFiles/scotch.dir/bgraph_bipart_gg.c.o.d -o CMakeFiles/scotch.dir/bgraph_bipart_gg.c.o -c /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bgraph_bipart_gg.c

and from WW3:
cd /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/build/model/src && /apps/oneapi/mpi/2021.5.1/bin/mpiifort -DENDIANNESS="'big_endian'" -DW3_BS0 -DW3_BT1 -DW3_CRT1 -DW3_CRX1 -DW3_DB1 -DW3_DIST -DW3_FLD2 -DW3_FLX0 -DW3_IC0 -DW3_IS0 -DW3_MLIM -DW3_MPI -DW3_NL1 -DW3_NOGRB -DW3_O0 -DW3_O1 -DW3_O14 -DW3_O15 -DW3_O2 -DW3_O3 -DW3_O4 -DW3_O5 -DW3_O6 -DW3_O7 -DW3_PDLIB -DW3_PR3 -DW3_REF0 -DW3_RWND -DW3_SCOTCH -DW3_SEED -DW3_ST4 -DW3_STAB0 -DW3_TR0 -DW3_UQ -DW3_WNT1 -DW3_WNX1 -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/build/model/src/mod -I/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/netcdf/4.7.4/include -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/install/include -g -module mod -no-fma -ip -g -traceback -i4 -real-size 32 -fp-model precise -assume byterecl -fno-alias -fno-fnalias -O0 -debug all -warn all -check all -check noarg_temp_created -fp-stack-check -heap-arrays -fpe0 -c /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/PDLIB/yowdatapool.F90 -o CMakeFiles/ww3_lib.dir/PDLIB/yowdatapool.F90.o

Second run with the following:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_2"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_2"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_2"

Error output:

       Wave model ...
[h36m10:264526:0:264526] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x315a8c0)
==== backtrace (tid: 264526) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155cde __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000567b85 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000567b85 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000567b85 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000567b85 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000567b85 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000567b85 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000567b85 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000567b85 PMPI_Irecv()  /build/impi/_buildspace/release/../../src/mpi/pt2pt/irecv.c:139
12 0x00000000014c9911 _SCOTCHdgraphMatchSyncPtop()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_ptop.c:204
13 0x00000000014ba324 _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
14 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
15 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
16 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
17 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
18 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
19 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
20 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
21 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
22 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
23 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
24 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
25 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
26 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
27 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/PDLIB/yowpdlibmain.F90:632
28 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/PDLIB/yowpdlibmain.F90:127
29 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/w3profsmd_pdlib.F90:265
30 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/w3initmd.F90:750
31 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/ww3_shel.F90:1903
32 0x0000000000407ea2 main()  ???:0
33 0x0000000000022555 __libc_start_main()  ???:0
34 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152421A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B1ED3F00630  Unknown               Unknown  Unknown
libc-2.17.so       00002B1ED4262CDE  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002B2020F7681B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002B2020D08DBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B1ED2A1BB85  MPI_Irecv             Unknown  Unknown
ww3_shel           00000000014C9911  Unknown               Unknown  Unknown
ww3_shel           00000000014BA324  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002B1ED412F555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown
srun: error: h36m10: task 2799: Exited with exit code 174

Third run with the following:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_3"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_3"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_3"

Error output:

       Wave model ...
[h36m52:271096:0:271096] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44e0640)
==== backtrace (tid: 271096) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155c4b __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000404320 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000404320 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000404320 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000404320 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000404320 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000404320 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000404320 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000404320 MPIC_Irecv()  /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
12 0x000000000014133b MPIR_Alltoallv_intra_scattered_impl()  /build/impi/_buildspace/release/../../src/mpi/coll/intel/alltoallv/alltoallv_intra_scattered.c:186
13 0x00000000001927a8 MPIDI_NM_mpi_alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:643
14 0x00000000001927a8 MPIDI_Alltoallv_intra_composition_alpha()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1794
15 0x00000000001927a8 MPID_Alltoallv_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:2276
16 0x00000000001927a8 MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3335
17 0x00000000001717ec MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
18 0x00000000002b44df MPID_Alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:240
19 0x0000000000142405 PMPI_Alltoallv()  /build/impi/_buildspace/release/../../src/mpi/coll/alltoallv/alltoallv.c:351
20 0x00000000014c8524 _SCOTCHdgraphMatchSyncColl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_coll.c:210
21 0x00000000014ba324 _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
22 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
23 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
24 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
25 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
26 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
27 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
28 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
29 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
30 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
31 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
32 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
33 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
34 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
35 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/PDLIB/yowpdlibmain.F90:632
36 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/PDLIB/yowpdlibmain.F90:127
37 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/w3profsmd_pdlib.F90:265
38 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/w3initmd.F90:750
39 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/ww3_shel.F90:1903
40 0x0000000000407ea2 main()  ???:0
41 0x0000000000022555 __libc_start_main()  ???:0
42 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152421A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B464B1C4630  Unknown               Unknown  Unknown
libc-2.17.so       00002B464B526C4B  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002B479823A81B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002B4797FCCDBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B4649B7C320  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B46498B933B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B464990A7A8  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B46498E97EC  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B4649A2C4DF  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B46498BA405  PMPI_Alltoallv        Unknown  Unknown
ww3_shel           00000000014C8524  Unknown               Unknown  Unknown
ww3_shel           00000000014BA324  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002B464B3F3555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown
srun: error: h36m52: task 2799: Exited with exit code 174

Forth run with all 3:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_1 -DSCOTCH_NOAA_DEBUG_2 -DSCOTCH_NOAA_DEBUG_3"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_1 -DSCOTCH_NOAA_DEBUG_2 -DSCOTCH_NOAA_DEBUG_3"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_1 -DSCOTCH_NOAA_DEBUG_2 -DSCOTCH_NOAA_DEBUG_3"

error:

       Wave model ...
[h36m52:271536:0:271536] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2a51540)
==== backtrace (tid: 271536) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155c4b __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000404320 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000404320 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000404320 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000404320 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000404320 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000404320 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000404320 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000404320 MPIC_Irecv()  /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
12 0x000000000014133b MPIR_Alltoallv_intra_scattered_impl()  /build/impi/_buildspace/release/../../src/mpi/coll/intel/alltoallv/alltoallv_intra_scattered.c:186
13 0x00000000001927a8 MPIDI_NM_mpi_alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:643
14 0x00000000001927a8 MPIDI_Alltoallv_intra_composition_alpha()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1794
15 0x00000000001927a8 MPID_Alltoallv_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:2276
16 0x00000000001927a8 MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3335
17 0x00000000001717ec MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
18 0x00000000002b44df MPID_Alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:240
19 0x0000000000142405 PMPI_Alltoallv()  /build/impi/_buildspace/release/../../src/mpi/coll/alltoallv/alltoallv.c:351
20 0x00000000014c852c _SCOTCHdgraphMatchSyncColl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_coll.c:210
21 0x00000000014ba32e _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
22 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
23 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
24 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
25 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
26 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
27 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
28 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
29 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
30 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
31 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
32 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
33 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
34 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
35 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/PDLIB/yowpdlibmain.F90:632
36 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/PDLIB/yowpdlibmain.F90:127
37 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/w3profsmd_pdlib.F90:265
38 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/w3initmd.F90:750
39 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/ww3_shel.F90:1903
40 0x0000000000407ea2 main()  ???:0
41 0x0000000000022555 __libc_start_main()  ???:0
42 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152428A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002ACC9037B630  Unknown               Unknown  Unknown
libc-2.17.so       00002ACC906DDC4B  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002ACDDD3F181B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002ACDDD183DBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8ED33320  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EA7033B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EAC17A8  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EAA07EC  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EBE34DF  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EA71405  PMPI_Alltoallv        Unknown  Unknown
ww3_shel           00000000014C852C  Unknown               Unknown  Unknown
ww3_shel           00000000014BA32E  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002ACC905AA555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown
srun: error: h36m52: task 2799: Exited with exit code 174

MatthewMasarik-NOAA · 2023-04-03T16:36:39Z

Hi all, this is a repost from the email thread in case it was missed. It is a traceback from when I started to test with GNU (meaning no Intel), so is a data point to compare with @JessicaMeixner-NOAA's Intel results just posted. It also shows the assessment of SCOTCH_ParMETIS_V3_PartGeomKway() routine being where things go awry, and work I had started to inspect within that SCOTCH routine, as well as checking for consistency of input args from WW3 to that routine.

The SCOTCH routine we use is SCOTCH_ParMETIS_V3_PartGeomKway(). The program is crashing during this subroutine.

That routine calls SCOTCH_ParMETIS_V3_PartKway(), which calls SCOTCH_dgraphBuild(), and then calls SCOTCH_dgraphCheck() (there's a #ifdef SCOTCH_DEBUG_ALL, I commented this #ifdef out while testing to be sure program execution goes through it). I added some print statements around these two calls as shown

It appears to be dying in SCOTCH_dgraphBuild(). A couple of the fastest process make it to 3. and 4. before the whole thing chokes, but it does seem to be having an issue somewhere in SCOTCH_dgraphBuild(), and is not making it to SCOTCH_dgraphCheck().
Here's the traceback

From the traceback the first intelligible record is for _SCOTCHdgraphCoarsen.

Since the problem seems to be in SCOTCH_ParMETIS_V3_PartGeomKway(), I've started to check the sizes/types of input to this routine. There are some related variables that are hard-coded ( REAL(4) ) or double precision, and some manual conversion between the two. To check these I've done some writing of these values out and inspect them manually. If closer inspection of the input args is needed, please let me know and I will follow this route further.

MatthewMasarik-NOAA · 2023-04-03T16:46:54Z

For current efforts, I am working to get the GNU make of SCOTCH + WW3 working as was requested. I have been able to get SCOTCH to compile this way so far, but am needing to troubleshoot the WW3 build using the output from GNU make SCOTCH. In summary, I believe the SCOTCH portions of our WW3 cmake build have been developed based on the output from the cmake build of SCOTCH. The output of the GNU make differs in someway that WW3's cmake is having a hard time using. I don't anticipate this being too difficult to solve though, so will hopefully have more to report soon.

aronroland · 2023-04-04T12:56:45Z

Hi @MatthewMasarik-NOAA, sure let's make a meeting with @JessicaMeixner-NOAA and check where we are wtr. to the work schedule, maybe @thesser1 can join us and we can discuss the actual state of work.

JessicaMeixner-NOAA · 2023-04-04T12:59:40Z

@aronroland I'm on leave Weds-Friday, so I'll set up a time for today. Should we invite Francois too, as we have some additional output that perhaps he can provide feedback on?

MatthewMasarik-NOAA · 2023-04-04T13:15:00Z

Hi @MatthewMasarik-NOAA, sure let's make a meeting with @JessicaMeixner-NOAA and check where we are wtr. to the work schedule, maybe @thesser1 can join us and we can discuss the actual state of work.

Sure thing, @aronroland. I'm looking forward to discussing today.

thesser1 · 2023-04-04T13:24:38Z

I am not available at 11:30 today, but I can update you that I was able to set up scotch on my cray computer running with intel compiler and cray-mpich mpi library. I compiled both scotch and ww3 with debug flags as described and I put the SCOTCH_NOAA_DEBUG_1 and SCOTCH_NOAA_DEBUG_3 flags on during the build process. I was able to run the case smoothly to 4200 cores. When I tried to double it to 8000 cores, the run stalled. I hope to find time today to pull the output of the stalled runs so we can learn what changed.

…

On Tue, Apr 4, 2023 at 9:15 AM Matthew Masarik ***@***.***> wrote: Hi @MatthewMasarik-NOAA <https://github.com/MatthewMasarik-NOAA>, sure let's make a meeting with @JessicaMeixner-NOAA <https://github.com/JessicaMeixner-NOAA> and check where we are wtr. to the work schedule, maybe @thesser1 <https://github.com/thesser1> can join us and we can discuss the actual state of work. Sure thing, @aronroland <https://github.com/aronroland>. I'm looking forward to discussing today. — Reply to this email directly, view it on GitHub <#879 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAU2O3BOVJG6LIL5V5NXQZ3W7QNGBANCNFSM6AAAAAAUPTBXWM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

MatthewMasarik-NOAA · 2023-04-07T15:16:15Z

SCOTCH_NOAA_DEBUG + SCOTCH GNU make

Results for running the three NOAA debug flags separately in the instrumented 'noaa2' SCOTCH repo. SCOTCH is built using the traditional/GNU make system with compiler/MPI, intel/impi 2022.1.2. The Makefile.inc.x86-64_pc_linux2.icc.impi.debug is used which has compile options

CFLAGS		= -g -O0 -fp-model strict -traceback -fp-stack-check -DCOMMON_FILE_COMPRESS_GZ -DCOMMON_PTHREAD -DCOMMON_PTHREAD_AFFINITY_LINUX -DCOMMON_RANDOM_FIXED_SEED -DSCOTCH_DEBUG_ALL -DSCOTCH_DETERMINISTIC -DSCOTCH_MPI_ASYNC_COLL -DSCOTCH_PTHREAD -DSCOTCH_PTHREAD_MPI -DSCOTCH_RENAME -restrict -DIDXSIZE64

along with each -DSCOTCH_NOAA_DEBUG_[1,2,3] flag appended to CFLAGS as verified in the noaa-[1,2,3]-scotch-make.out.txt files in each section below.

SCOTCH_NOAA_DEBUG_1

noaa-1-scotch-make.out.txt
noaa-1-slurm-43586053.out.truncated.txt

Traceback (see noaa-1-slurm-43586053.out.truncated.txt for more output)

       Type 4 : Restart files
      -----------------------------------------
            From     : 2020/08/15 00:00:00 UTC
            To       : 2020/08/20 00:00:00 UTC
            Interval :          1 00:00:00

            output dates out of run dates : Track point output deactivated
            output dates out of run dates : Nesting data deactivated
            output dates out of run dates : Partitioned wave field data deactivated
            output dates out of run dates : Restart files second request deactivated
       Wave model ...
(736): ERROR: dgraphCoarsen: invalid matching
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(2599): ERROR: dgraphCoarsen: invalid matching
(162): ERROR: dgraphCoarsen: invalid matching
(2089): ERROR: dgraphCoarsen: invalid matching
(1522): ERROR: dgraphCoarsen: invalid matching
(1857): ERROR: dgraphCoarsen: invalid matching
(1465): ERROR: dgraphCoarsen: invalid matching
(1058): ERROR: dgraphCoarsen: invalid matching

SCOTCH_NOAA_DEBUG_2

noaa-2-scotch-make.out.txt
noaa-2-slurm-43586629.out.truncated.txt

Traceback (see noaa-2-slurm-43586629.out.truncated.txt for more output)

      Type 4 : Restart files
      -----------------------------------------
            From     : 2020/08/15 00:00:00 UTC
            To       : 2020/08/20 00:00:00 UTC
            Interval :          1 00:00:00

            output dates out of run dates : Track point output deactivated
            output dates out of run dates : Nesting data deactivated
            output dates out of run dates : Partitioned wave field data deactivated
            output dates out of run dates : Restart files second request deactivated
       Wave model ...
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(1722): ERROR: dgraphCoarsen: invalid matching
(2520): ERROR: dgraphCoarsen: invalid matching
(1880): ERROR: dgraphCoarsen: invalid matching
(840): ERROR: dgraphCoarsen: invalid matching
(1961): ERROR: dgraphCoarsen: invalid matching
(2204): ERROR: dgraphCoarsen: invalid matching

SCOTCH_NOAA_DEBUG_3

noaa-3-scotch-make.out.txt
noaa-3-slurm-43586236.out.truncated.txt

Traceback (see noaa-3-slurm-43586236.out.truncated.txt for more output)

       Type 4 : Restart files
      -----------------------------------------
            From     : 2020/08/15 00:00:00 UTC
            To       : 2020/08/20 00:00:00 UTC
            Interval :          1 00:00:00

            output dates out of run dates : Track point output deactivated
            output dates out of run dates : Nesting data deactivated
            output dates out of run dates : Partitioned wave field data deactivated
            output dates out of run dates : Restart files second request deactivated
       Wave model ...
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(25): ERROR: dgraphCoarsen: invalid matching
(1641): ERROR: dgraphCoarsen: invalid matching
(2542): ERROR: dgraphCoarsen: invalid matching
(1925): ERROR: dgraphCoarsen: invalid matching
(1450): ERROR: dgraphCoarsen: invalid matching
(1253): ERROR: dgraphCoarsen: invalid matching
(124): ERROR: dgraphCoarsen: invalid matching
(1819): ERROR: dgraphCoarsen: invalid matching
(600): ERROR: dgraphCoarsen: invalid matching
(200): ERROR: dgraphCoarsen: invalid matching

JessicaMeixner-NOAA · 2023-04-10T18:42:01Z

I've run the test case building with intel/2022.1.2 and mvapich2/2.3 last week and again today and my job just hangs and I get no output. I believe this is consistent with what @thesser1 reported as well.

MatthewMasarik-NOAA · 2023-04-10T19:38:38Z

Trying to understand why each of the three tests (-DSCOTCH_NOAA_DEBUG_1,2,3) gave the same output, I re-ran them but this time removed the -DSCOTCH_DEBUG_ALL flag. I also removed the -DSCOTCH_PTHREAD and -DSCOTCH_PTHREAD_MPI since we have not been using those. This gave essentially the same output each time, though it is the realloc error that has been shared before

       Type 4 : Restart files                                                                                                      
      -----------------------------------------                                                                                    
            From     : 2020/08/15 00:00:00 UTC                                                                                     
            To       : 2020/08/20 00:00:00 UTC                                                                                     
            Interval :          1 00:00:00                                                                                         
                                                                                                                                   
            output dates out of run dates : Track point output deactivated                                                         
            output dates out of run dates : Nesting data deactivated                                                               
            output dates out of run dates : Partitioned wave field data deactivated                                                
            output dates out of run dates : Restart files second request deactivated                                               
       Wave model ...                                                                                                              
*** Error in `/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel': realloc(): invalid next s\
ize: 0x000000000420eca0 ***                                                                                                        
======= Backtrace: =========                                                                                                       
/lib64/libc.so.6(+0x7f474)[0x2b1757cfa474]                                                                                         
/lib64/libc.so.6(+0x84861)[0x2b1757cff861]                                                                                         
/lib64/libc.so.6(realloc+0x1d2)[0x2b1757d00e12]                                                                                    
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c211a]                                
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14b4a5d]                                
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14b7449]

This is with flags

CFLAGS=-g -O0 -fp-model strict -traceback -fp-stack-check -DCOMMON_FILE_COMPRESS_GZ -DCOMMON_PTHREAD -DCOMMON_PTHREAD_AFFINITY_LINUX -DCOMMON_RANDOM_FIXED_SEED  -DSCOTCH_DETERMINISTIC -DSCOTCH_MPI_ASYNC_COLL -DSCOTCH_RENAME -restrict -DIDXSIZE64

MatthewMasarik-NOAA · 2023-04-12T20:29:45Z

Update - Aron's debug flags

Reporting output for runs that use:

SCOTCH built by traditional make using the suggested Makefile.Inc for intel/debug
WW3 built with Aron's Fortran debug flags

For these builds the SCOTCH_NOAA_DEBUG_1,2,3 tests were each run separately. They produced output in each case that is similar, so I'll display the traceback for SCOTCH_NOAA_DEBUG_1 and post the related logs below.

       Type 4 : Restart files                                                                                                                                            
      -----------------------------------------                                                                                                                          
            From     : 2020/08/15 00:00:00 UTC                                                                                                                           
            To       : 2020/08/20 00:00:00 UTC                                                                                                                           
            Interval :          1 00:00:00                                                                                                                               
                                                                                                                                                                         
            output dates out of run dates : Track point output deactivated                                                                                               
            output dates out of run dates : Nesting data deactivated                                                                                                     
            output dates out of run dates : Partitioned wave field data deactivated                                                                                      
            output dates out of run dates : Restart files second request deactivated                                                                                     
       Wave model ...                                                                                                                                                    
*** Error in `/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel': realloc(): invalid next size: 0x0000000003c85230 ***            
======= Backtrace: =========                                                                                                                                             
/lib64/libc.so.6(+0x7f474)[0x2b537a529474]                                                                                                                               
/lib64/libc.so.6(+0x84861)[0x2b537a52e861]                                                                                                                               
/lib64/libc.so.6(realloc+0x1d2)[0x2b537a52fe12]                                                                                                                          
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14d4620]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c6f6d]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c9959]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c9cdb]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14bdb77]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c0745]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c0c96]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14bf02c]
        .
 -- truncated --
        .
        .
7ffdf9bfd000-7ffdf9c3a000 rw-p 00000000 00:00 0                          [stack]                                                                                         
7ffdf9ce4000-7ffdf9ce6000 r-xp 00000000 00:00 0                          [vdso]                                                                                          
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]                                                                                      
forrtl: error (76): Abort trap signal                                                                                                                                    
Image              PC                Routine            Line        Source                                                                                               
ww3_shel           0000000001554A1B  Unknown               Unknown  Unknown                                                                                              
libpthread-2.17.s  00002B537A29D630  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A4E0387  gsignal               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A4E1A78  abort                 Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A522F67  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A529474  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A52E861  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A52FE12  realloc               Unknown  Unknown                                                                                              
ww3_shel           00000000014D4620  _SCOTCHdgraphCoar        1436  dgraph_coarsen.c                                                                                     
ww3_shel           00000000014C6F6D  bdgraphBipartMlCo         114  bdgraph_bipart_ml.c                                                                                  
ww3_shel           00000000014C9959  bdgraphBipartMl2          775  bdgraph_bipart_ml.c                                                                                  
ww3_shel           00000000014C9CDB  _SCOTCHbdgraphBip         826  bdgraph_bipart_ml.c                                                                                  
ww3_shel           00000000014BDB77  _SCOTCHbdgraphBip         377  bdgraph_bipart_st.c                                                                                  
ww3_shel           00000000014C0745  kdgraphMapRbPart2         387  kdgraph_map_rb_part.c                                                                                
ww3_shel           00000000014C0C96  _SCOTCHkdgraphMap         437  kdgraph_map_rb_part.c                                                                                
ww3_shel           00000000014BF02C  _SCOTCHkdgraphMap         261  kdgraph_map_rb.c                                                                                     
ww3_shel           00000000014BD13E  _SCOTCHkdgraphMap         186  kdgraph_map_st.c                                                                                     
ww3_shel           00000000014B7A0F  SCOTCH_dgraphMapC         191  library_dgraph_map.c                                                                                 
ww3_shel           00000000014B66D4  SCOTCH_ParMETIS_V         182  parmetis_dgraph_part.c                                                                               
ww3_shel           00000000014B6971  SCOTCH_ParMETIS_V         230  parmetis_dgraph_part.c                                                                               
ww3_shel           00000000014B5612  SCOTCH_PARMETIS_V         118  parmetis_dgraph_part_f.c                                                                             
ww3_shel           00000000014B5409  scotch_parmetis_v          96  parmetis_dgraph_part_f.c                                                                             
ww3_shel           00000000012A1598  yowpdlibmain_mp_r         632  yowpdlibmain.F90                                                                                     
ww3_shel           000000000128F15A  yowpdlibmain_mp_i         127  yowpdlibmain.F90                                                                                     
ww3_shel           00000000010E886A  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90                                                                                  
ww3_shel           00000000008A2BF0  w3initmd_mp_w3ini         750  w3initmd.F90                                                                                         
ww3_shel           000000000044910C  MAIN__                   1903  ww3_shel.F90                                                                                         
ww3_shel           0000000000408562  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A4CC555  __libc_start_main     Unknown  Unknown                                                                                              
ww3_shel           0000000000408469  Unknown               Unknown  Unknown                                                                                              
srun: error: h34m52: task 2599: Aborted (core dumped)                                                                                                                    
srun: launch/slurm: _step_signal: Terminating StepId=43728715.0                                                                                                          
slurmstepd: error: *** STEP 43728715.0 ON h1c17 CANCELLED AT 2023-04-11T19:11:02 ***                                                                                     
forrtl: error (78): process killed (SIGTERM)

The same runs were done with fprintf statements, which confirmed that each of the added #ifdef SCOTCH_NOAA_DEBUG_* blocks were entered. These files are not included because they give the same information as above, but get really heavy with all the write statements to stderr. When running with 2600 MPI tasks, the counts of write statements upon entry of each block are:

SCOTCH_NOAA_DEBUG_1 (A): 1635560
SCOTCH_NOAA_DEBUG_1 (B): 1635560
SCOTCH_NOAA_DEBUG_2: 13000
SCOTCH_NOAA_DEBUG_3: 13000

Q: I've been compiling SCOTCH (both cmake and now traditional make) without SCOTCH_PTHREADS or SCOTCH_PTHREAD_MPI set, then running with 2600 MPI tasks. Do these number seem right -- should we be seeing these high of counts?

Putting it together

From these tracebacks we are clearly having a crash in dgraph_coarsen.c (line 1436), with a realloc error.

And from the tracebacks with SCOTCH_DEBUG_ALL set, from the comment above it also shows dgraphCoarsen as the point of failure

       Wave model ...
(736): ERROR: dgraphCoarsen: invalid matching
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(2599): ERROR: dgraphCoarsen: invalid matching

with an error message mentioning undersized multinode array (4) in the Match.

Line 1436 in dgraph_coarsen.c has a call to `memRealloc(...) to resize a multinode array.

1434: if (matedat.c.multloctmp != NULL) {             /* If we allocated the multinode array */                                                                              
1435:   matedat.c.multloctmp =                                                                                                                                               
1436:   matedat.c.multloctab = memRealloc (matedat.c.multloctab, matedat.c.multlocnbr * sizeof (DgraphCoarsenMulti)); /* Resize multinode array */                           
1437: }

Is line 1435 intended to be the way it is? It may be. I'm guessing this syntax would evaluate both lines 1435-1436 as a single command, being a multiple variable assignment to the output of memRealloc.

MatthewMasarik-NOAA · 2024-01-02T18:24:48Z

This issue was resolved within SCOTCH by release 7.0.4. A scotch/7.0.4 module has previously been added to spack-stack/1.5.0 and installed and tested on RDHPCS machines. scotch/7.0.4 has also been installed on WCOSS2. Final check to confirm scalability was done by @JessicaMeixner-NOAA on cactus (~22 dec 2023) by running WW3 coupled with 6000 PETs for the wave component.

MatthewMasarik-NOAA added the bug Something isn't working label Feb 2, 2023

JessicaMeixner-NOAA added this to WW3 Unstructured Grid Technical Development Required for GFSv17 Feb 9, 2023

JessicaMeixner-NOAA assigned aliabdolali and MatthewMasarik-NOAA Feb 9, 2023

aliabdolali removed their assignment Mar 27, 2023

MatthewMasarik-NOAA closed this as completed Jan 2, 2024

github-project-automation bot moved this from In Progress to Done in WW3 Unstructured Grid Technical Development Required for GFSv17 Jan 2, 2024

Scaling for unstructured grids using SCOTCH for domain decomposition #879

Scaling for unstructured grids using SCOTCH for domain decomposition #879

Comments

MatthewMasarik-NOAA commented Feb 2, 2023

aronroland commented Feb 2, 2023 via email

MatthewMasarik-NOAA commented Feb 3, 2023

aronroland commented Feb 3, 2023 via email

JessicaMeixner-NOAA commented Feb 3, 2023

MatthewMasarik-NOAA commented Feb 3, 2023

aronroland commented Feb 3, 2023 via email

MatthewMasarik-NOAA commented Feb 3, 2023

aronroland commented Feb 3, 2023 via email

MatthewMasarik-NOAA commented Feb 3, 2023

arunchawla-NOAA commented Feb 8, 2023

MatthewMasarik-NOAA commented Feb 8, 2023

arunchawla-NOAA commented Feb 9, 2023

aliabdolali commented Feb 9, 2023

JessicaMeixner-NOAA commented Feb 9, 2023

aronroland commented Feb 9, 2023 via email

aronroland commented Feb 9, 2023 via email

aliabdolali commented Feb 9, 2023

MatthewMasarik-NOAA commented Feb 9, 2023

aronroland commented Feb 10, 2023

MatthewMasarik-NOAA commented Feb 10, 2023

JessicaMeixner-NOAA commented Feb 10, 2023

aronroland commented Feb 10, 2023

JessicaMeixner-NOAA commented Feb 10, 2023

aliabdolali commented Feb 10, 2023

aronroland commented Feb 10, 2023

aronroland commented Feb 10, 2023 • edited Loading

MatthewMasarik-NOAA commented Mar 20, 2023

aronroland commented Mar 20, 2023

MatthewMasarik-NOAA commented Mar 20, 2023

aliabdolali commented Mar 20, 2023

thesser1 commented Mar 21, 2023 via email

aronroland commented Mar 22, 2023

MatthewMasarik-NOAA commented Mar 22, 2023

aronroland commented Mar 25, 2023 • edited Loading

arunchawla-NOAA commented Mar 30, 2023

JessicaMeixner-NOAA commented Mar 30, 2023

JessicaMeixner-NOAA commented Mar 31, 2023 • edited Loading

aronroland commented Mar 31, 2023 • edited Loading

JessicaMeixner-NOAA commented Mar 31, 2023

aronroland commented Mar 31, 2023 • edited Loading

aronroland commented Mar 31, 2023

JessicaMeixner-NOAA commented Apr 3, 2023 • edited Loading

MatthewMasarik-NOAA commented Apr 3, 2023

MatthewMasarik-NOAA commented Apr 3, 2023

aronroland commented Apr 4, 2023

JessicaMeixner-NOAA commented Apr 4, 2023

MatthewMasarik-NOAA commented Apr 4, 2023

thesser1 commented Apr 4, 2023 via email

MatthewMasarik-NOAA commented Apr 7, 2023

SCOTCH_NOAA_DEBUG + SCOTCH GNU make

SCOTCH_NOAA_DEBUG_1

SCOTCH_NOAA_DEBUG_2

SCOTCH_NOAA_DEBUG_3

JessicaMeixner-NOAA commented Apr 10, 2023

MatthewMasarik-NOAA commented Apr 10, 2023

MatthewMasarik-NOAA commented Apr 12, 2023

Update - Aron's debug flags

SCOTCH_NOAA_DEBUG_1

SCOTCH_NOAA_DEBUG_2

SCOTCH_NOAA_DEBUG_3

Putting it together

MatthewMasarik-NOAA commented Jan 2, 2024

aronroland commented Feb 10, 2023 •

edited

Loading

aronroland commented Mar 25, 2023 •

edited

Loading

JessicaMeixner-NOAA commented Mar 31, 2023 •

edited

Loading

aronroland commented Mar 31, 2023 •

edited

Loading

aronroland commented Mar 31, 2023 •

edited

Loading

JessicaMeixner-NOAA commented Apr 3, 2023 •

edited

Loading