Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling for unstructured grids using SCOTCH for domain decomposition #879

Closed
MatthewMasarik-NOAA opened this issue Feb 2, 2023 · 131 comments
Assignees
Labels
bug Something isn't working

Comments

@MatthewMasarik-NOAA
Copy link
Collaborator

Describe the bug
Running WW3 with unstructured grids using the SCOTCH mesh/hypergraph partitioning library for MPI domain decomposition, scales to ~2K cores, grid size dependent. Above this core count WW3 will fail during model initialization.

This behavior was found during scaling simulations in which allowable resources are ~8K cores. Experiments for two separate mesh's: unst1 = ~0.5M nodes, unst2 = ~1.8M nodes, were conducted on hera. I was unable to run the same experiments on another HPC machine (there are ongoing issues with building WW3/SCOTCH on orion, and SCOTCH is currently not available on WCOSS2, which are the machines I have access to).

Note: ParMetis, which is the partitioning library SCOTCH is replacing, was able to scale out to ~8K cores for each of the grids.

To Reproduce

  1. Build SCOTCH
  2. Build WW3 with SCOTCH
  3. Run executable with cores counts (= MPI tasks) >~2K

Expected behavior
WW3 will error and core dump.

  • SCOTCH build instructions for (Intel) hera
# https://gitlab.inria.fr/scotch/scotch.git

cd scotch

module purge
module load cmake/3.20.1
module load intel/2022.1.2
module load impi/2022.1.2
module use  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
module load hdf5/1.10.6
module load netcdf/4.7.4
module load gnu/9.2.0

mkdir build && cd build
cmake -DCMAKE_Fortran_COMPILER=ifort            \
      -DCMAKE_C_COMPILER=icc                    \
      -DCMAKE_INSTALL_PREFIX=<path-to>/install  \
      -DCMAKE_BUILD_TYPE=Release ..             |& tee cmake.out
make  VERBOSE=1                                 |& tee make.out
make  install

Screenshots

  • hera environment used (job card)
#SBATCH -q batch                  
#SBATCH -t 08:00:00               
#SBATCH --cpus-per-task=1         
#SBATCH -n 2400                   
#SBATCH --exclusive

  module purge                                                                                               
  module load cmake/3.20.1                                                                                   
  module load intel/2022.1.2                                                                                 
  module load impi/2022.1.2                                                                                  
  module use  /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack               
  module load hpc/1.2.0                                                                                      
  module load hpc-intel/2022.1.2                                                                             
  module load hpc-impi/2022.1.2                                                                              
  module load jasper/2.0.25                                                                                  
  module load zlib/1.2.11                                                                                    
  module load libpng/1.6.37                                                                                  
  module load hdf5/1.10.6                                                                                    
  module load netcdf/4.7.4                                                                                   
  module load bacio/2.4.1                                                                                    
  module load g2/3.4.5                                                                                       
  module load w3emc/2.9.2                                                                                    
  module load esmf/8.3.0b09                                                                                  
  module load gnu/9.2.0                                                                                      
  export SCOTCH_PATH=/scratch1/NCEPDEV/climate/Matthew.Masarik/waves/opt/hpc-stack/scotch/install            
                                                                                                             
  ulimit -s unlimited                                                                                        
  ulimit -c 0                                                                                                
  export KMP_STACKSIZE=2G                                                                                    
  export FI_OFI_RXM_BUFFER_SIZE=128000                                                                       
  export FI_OFI_RXM_RX_SIZE=64000                                                                            
                                                                                                             
  export OMP_NUM_THREADS=1  
  • log output / error message
    image

  • Results from the two grids mentioned above. Both were run separately using SCOTCH, and ParMetis for decomposition.

    • unst1
      • SCOTCH: scaled to ~1800 cores.
      • ParMetis: scaled through the allowable range, 8K cores.
    • unst2
      • SCOTCH: scaled to ~2200 cores.
      • ParMetis: scaled through allowable range, 8K cores.
        The plot below shows this behavior.
        msh2_scal

Additional context
This stems from current PR #849.

This issue is intended to be a place we can all collect information. @aliabdolali @aronroland please share any information you've learned working on this topic.

TODO

  • For the unstructured meshes, OMP Threads cannot be used. However, to test if this is memory related, I will be re-running the experiments and adding more than one cpu-per-task, with one thread, to provide more more memory per task.
  • Another possible detail. The serial version of Intel compilers (ifort/icc) are passed to cmake. I believe from the variable names this is correct, though I'm not 100% certain that they shouldn't be the MPI wrapper names, mpiifort/mpiicc.
@MatthewMasarik-NOAA MatthewMasarik-NOAA added the bug Something isn't working label Feb 2, 2023
@aronroland
Copy link
Collaborator

aronroland commented Feb 2, 2023 via email

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Hi Aron,
That is very helpful information. Thank you for sharing your experience. I have those re-runs in the works so I'll report back when I have the results. That would be amazing news if we can fix this just in the job card. Cheers and thanks again for the insight.

@aronroland
Copy link
Collaborator

aronroland commented Feb 3, 2023 via email

@JessicaMeixner-NOAA
Copy link
Collaborator

@aronroland we understand if you have other priorities, but this is a high priority item for us and we'll continue to work on this. It'd be great if you could include us (@MatthewMasarik-NOAA and myself) on your conversations with the SCOTCH developers on this issue.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

I discussed also with Ali and I think the 1st step should be to check with your admins and memcheck on that.

Hi Aron, I was able to get some runs in yesterday and will be able to share the results here this afternoon. Please stay tuned. Thanks

@aronroland
Copy link
Collaborator

aronroland commented Feb 3, 2023 via email

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Following @aronroland's initial comments above I performed some more scaling runs where I added cores to tasks to increase the memory (though OMP Thread count stays at 1, for unstruct. so, cores != tasks x threads, for the new cases). The new cases ran are cores-per-task=<2,4>, these are the only cases that made sense to me -- cores-per-task=1 are what has already been run, and cores-per-task=6 (or above) will eat up too many cores for memory alone, leaving the corresponding task count to low for performance.

I attempted to run total core counts between 1K -- 8K. All 4-core runs completed. In the 2-core runs the model crashed after ~4K, in the same manner as before. The table gives the parameter details for the highest scaling/best performance for each of the core-per-task cases.

cores-per-task max tot cores mpi tasks min runtime/sim day
1 2200 2200 992sec, ~17min
2 4000 2000 933sec, ~16min
4 8000 2000 903sec, ~15min

scal_scotch_mem_runs

@aronroland
Copy link
Collaborator

aronroland commented Feb 3, 2023 via email

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Hi @aronroland, quick clarification: i only performed runs with total core requests up to 8K because this is our rough upper limit of projected resources for GFSv17. Runs past 8K for the cores-per-task=4 are unknown.

@arunchawla-NOAA
Copy link
Contributor

@MatthewMasarik-NOAA can you go past the 8k cores with cores-per-task=4 and see how far you can push this?

@MatthewMasarik-NOAA
Copy link
Collaborator Author

@arunchawla-NOAA yes, I would like be able to try this. The max core request on hera batch queue is 8400, so that is as far as out as I've been able to run. Since with 4 cores/tasks you have 2100 mpi tasks, it is still close enough to ~2K mpi tasks, and that run succeeded. I am going to be working more on acorn tomorrow, but I believe you mentioned that machine just has 4096 cores.

So to push out past 8K cores with cores-per-task=4, I think I would need to submit a request to do a run in hera's novel queue. Is this something I should look into, or is there another avenue to run that high of core counts?

@arunchawla-NOAA
Copy link
Contributor

Thanks Matt A few things

  1. I am curious to know how many cores does it take to bring the wave model timing down so how far you can extend will be good to know
  2. so at cores-per-task= 4 you are able to blow past the problems you are having with the library crashing? I still do not understand what the issue with the 2000 mpi tasks is
  3. It will be good to see if you can install the library on acorn and run the model so we can make sure that the problem system is only Orion. If we cannot solve that problem on Orion we will have to reach out to help desk

@aliabdolali
Copy link
Contributor

@arunchawla-NOAA @MatthewMasarik-NOAA
Here are my thoughts:

  • We need to compile SCOTCH with debug flag, as consistent as possible to Debug flags in WW3 and try it. We can then report to SCOTCH developers the outcomes of our simulation with debug flag.
  • I am not sure about the practical difference between ntasks-per-node and cores-per-task, but based on my experience after years of using unstructured ww3 on various scales and different setup size is ntasks-per-node=20, or 30 out of 40 cores/node can help to have more memory/core if memory is an issue. However, I do not think our problem is a memory issue.
  • Testing on a different platform like acron would be beneficiary.

@JessicaMeixner-NOAA
Copy link
Collaborator

@aliabdolali you mentioned 2-3 weeks ago you'd be looking into checking the decomposition and running with more diagnostic output. Any update on that?

@MatthewMasarik-NOAA I'm still not sure what running with the novel queue will tell us that we can't find out other ways, but there's no harm in making that request and doing that run. For the runs that are completing, it'd be interesting to see the memory usage (and perhaps comparing with parametis memory usage) since memory usage still seems to be a theory. I think I lost this, but what happens if we run with 2 threads and more than 2,000mpi tasks? Or does that as well need the novel queue?

@aronroland
Copy link
Collaborator

aronroland commented Feb 9, 2023 via email

@aronroland
Copy link
Collaborator

aronroland commented Feb 9, 2023 via email

@aliabdolali
Copy link
Contributor

@aronroland thanks for pushing it further to SCOTCH developers.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

@arunchawla-NOAA @JessicaMeixner-NOAA @aliabdolali @aronroland,

Running in debug mode on orion is underway. Also, getting SCOTCH built on acorn and running a canned case there is also highest priority. Regarding the MPI tasks, I want to clarify I've just referred to 2K as roughly where runs start to die. In the case of 4 cores-per-task, I initially ran 8K cores with 2K tasks, and that was successful. I found hera's limit of 8400 cores when I wanted to see far it could be pushed. I did one 4-core run at the 8400 limit (2100 tasks), and that was successful. Since other runs had made it to 2200 mpi tasks before dying, I was not surprised, so that's what I was referring to previously.

I think I lost this, but what happens if we run with 2 threads and more than 2,000mpi tasks?

In this case (2 cores), the runs crash. The next increment of resources I had tried with 2 cores was 2,200mpi tasks. This count (and several other increments between 2200-4000 mpi tasks) all crashed.

@aronroland
Copy link
Collaborator

@MatthewMasarik-NOAA, @JessicaMeixner-NOAA,

is there any news on the run using the debug flags?

@MatthewMasarik-NOAA
Copy link
Collaborator Author

@aronroland I have been digging into it. I can give an update early afternoon.

@JessicaMeixner-NOAA
Copy link
Collaborator

@aronroland - no news from me, working on debugging building scotch on orion which is the blocking issue for the PR. Any news from you?

@aronroland
Copy link
Collaborator

@JessicaMeixner-NOAA as I said i cannot run on that many cores since I do not have access to more than 448 cores. I am waiting for the debug part so we can see the nature of the problem? Anything I missed that I should do? By the way before I forgot it would be great of scotch and ww3 could be builld with the same debug flags and if we could have a scotch build for debugging. This would be really helpful. Thanks for your hard work on that issue.

@JessicaMeixner-NOAA
Copy link
Collaborator

@aronroland do you think it'd be potentially interesting/useful to compare the memory usage between scotch and parmetis even on smaller node counts to see if its vastly different even for smaller number of cores? Still haven't heard anything from @aliabdolali who was going to look into the decomposition and run with extra output, which could give us more information as well.

@aliabdolali
Copy link
Contributor

We need SCOTCH outputs with debug, so we can ask SCOTCH developers to take a look. I think Aron and I asked for it days ago. I'd appreciate it if you do it at your earliest convenience, then we can continue.
I do not believe this is a memory issue, but any info would be helpful.

@aronroland
Copy link
Collaborator

This would be clearly the next step, but, honestly, before we have the debug/traceback it is fishing in the dark. Otherwise i had some issue with OASIS on memory issues and finally the sysadmin from DATARMOR was so kind to give hints on the memory usage, could u check with your sysadmin if he can tell something about that, when looking on the jobID? As u have the debug I go on with the memory examination ...

@aronroland
Copy link
Collaborator

aronroland commented Feb 10, 2023

@JessicaMeixner-NOAA I was thinking more on the memory issue, we cannot see this basically without modifying the scotch code. So that without having deep inside in the memory management of SCOTCH it will not be helpful looking at WW3 since for WW3 the memory does not depend on SCOTCH or PARMETIS.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

It would be great if you could make a list of those flags and explain for each of them why they are used and i am curious whether u have tried the flags I have send u? It would be great for us developers to understand your environment at NOAA.

Hi @aronroland, as far as a list of the flags, they are just as they appear in those two lines from the CMakeList.txt file. It may be that we need to look into the flags we are using, though currently so we don't get sidetracked, I understood from Ali (meeting last Thu) that the run done at ERDC out to 8K cores used the standard WW3 cmake compile, so using those flags listed.

@aronroland
Copy link
Collaborator

It would be great if you could make a list of those flags and explain for each of them why they are used and i am curious whether u have tried the flags I have send u? It would be great for us developers to understand your environment at NOAA.

Hi @aronroland, as far as a list of the flags, they are just as they appear in those two lines from the CMakeList.txt file. It may be that we need to look into the flags we are using, though currently so we don't get sidetracked, I understood from Ali (meeting last Thu) that the run done at ERDC out to 8K cores used the standard WW3 cmake compile, so using those flags listed.

@MatthewMasarik-NOAA about flags used at ERDC u should discuss with @thesser1.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

@aronroland, from my conversation with @aliabdolali last Thu he stated that for the particular run in question at ERDC, the standard flags had been used. Do you believe different flags were used? @aliabdolali @thesser1 can either of you confirm if the standard WW3 compile options were used for the 8K core run?

@aliabdolali
Copy link
Contributor

From what I recall, Ty compiled SCOTCH the same way I did initially, and tested WW3 with its release flags. But I'll leave it to him to confirm.
I usually use Aron's flags during development and debugging as WW3 standard flags (including debug) usually do not provide insightful info.

@thesser1
Copy link
Collaborator

thesser1 commented Mar 21, 2023 via email

@aronroland
Copy link
Collaborator

It would be great if you could make a list of those flags and explain for each of them why they are used and i am curious whether u have tried the flags I have send u? It would be great for us developers to understand your environment at NOAA.

Hi @aronroland, as far as a list of the flags, they are just as they appear in those two lines from the CMakeList.txt file. It may be that we need to look into the flags we are using, though currently so we don't get sidetracked, I understood from Ali (meeting last Thu) that the run done at ERDC out to 8K cores used the standard WW3 cmake compile, so using those flags listed.

This just slipped. Let me be frank Matt, u must know exactly which flags are used and why they are used. I think that some of the flags that are used like the i4, 32 bit stuff are a very bad choice. Most likely used to make the model b4b, we will now check if it is b4b using other flags.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

This just slipped. Let me be frank Matt, u must know exactly which flags are used and why they are used. I think that some of the flags that are used like the i4, 32 bit stuff are a very bad choice. Most likely used to make the model b4b, we will now check if it is b4b using other flags.

@aronroland, I wholeheartedly agree with your sentiment we should know what flags we are using and why.

I've tried to do some tests using the Intel debug flags you gave in the post, though I'm running into problems with the compile. Here's what I tried. First I tried using just those flags, and no others by removing the standard + debug flags and replacing them with yours. The compile failed. Next I tried putting back in the standard Intel flags, and replacing the debug flags with your flags. This compile also failed. The Intel compiler doesn't like / seem to recognize some of the flags so I am going to try removing those until the compile succeeds. I'll keep you posted. Have you and Ali been successful compiling with those compile flags on any NOAA machines?

@aronroland
Copy link
Collaborator

aronroland commented Mar 25, 2023

Matt, "the compile failed" can u please be more specific u can just paste everything here what u got. I am using always this flags, because I know why I am using them and what for I am using them. Those flags have not been developed by me, they are developed by INTEL with clear purpose (see the intel FORTRAN compiler manual). This I know precisely and therefore I am able to interpret this decently. I would very much appreciate if u share any kind of compiler problems bug's in stdout and my warm suggestion is to not going forward as long we have compilation issues. I wish u would be willing sharing this stdout with us, let me thank u in advance for your precious work. Btw NOAA machine is nothing special, it uses intel cpu's and Mellanox or other IB network, so there is no magic there with "NOAA" being particular with respect to the hardware infrastructure.

@aliabdolali aliabdolali removed their assignment Mar 27, 2023
@arunchawla-NOAA
Copy link
Contributor

Thank you for an excellent meeting today. Here are the main points

When Tyler ran the system (ww3+scotch) using impi, he had failures similar to the ones that NOAA has had. The system works for sgimpt. So the following options have been suggested as the path forward

-- We shall focus all attention on the debug build options so we can have adequate traceback options. Aron will provide the the debug options we should use for WW3 build (over the standard debug options that we have)

-- We should not use the cmake build option for scotch, but use one of the make build options that are available for now for debugging options (Again Aron will direct us to which ones)

-- We shall wait for a newer instrumented version of SCOTCH from Francois (It was not clear if we should use the instrumented version of the code Francois had already provided or he would provide a newer version)

-- EMC will provide traceback error location using the impi library, in debug mode so that we can know exactly where the problems are occurring. EMC will also test with other mpi libraries it has access to

-- Tyler will do the same with his machines. He will test also with the one mpi library that works (sgimpt)

-- Aron will provide options to how we can compile the mpi libraries in debug mode to see if that will provide more information

-- If these options do not provide an indication on where the problem is occurring we will proceed to more detailed debugging using mpi-barriers and print statements

Thank you and please add anything I missed

@JessicaMeixner-NOAA
Copy link
Collaborator

I tested @MatthewMasarik-NOAA canned case (v7.0.3, not the version just emailed) with intel 18 on hera:

module use /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/modulefiles/stack
module load hpc/1.1.0
module load hpc-intel/18.0.5.274
module load hpc-impi/2018.0.4

and it failed with:

Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(170)......................: MPI_Irecv(buf=0x33f6ad8, count=2, MPI_INT, src=702, tag=300, MPI_COMM_WORLD, request=0x3067bf4) failed
MPIDI_CH3U_Request_unpack_uebuf(618): Message truncated; 296 bytes received but buffer size is 8

@JessicaMeixner-NOAA
Copy link
Collaborator

JessicaMeixner-NOAA commented Mar 31, 2023

Here are some output from running with the various SCOTCH_NOAA_DEBUG flags. It's likely these need to be re-run with additional compiler flags turned on for SCOTCH to get additional traceback information. All results below use the WW3 default debug cmake options and build SCOTCH w/cmake in debug mode Intel 18 (unintentionally changed from above test, will re-run with 2022) and Impi.

Building with:
-DSCOTCH_NOAA_DEBU1=ON
Error:

Invalid count 1
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000154300D  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B6AAE92E630  Unknown               Unknown  Unknown
ww3_shel           00000000014E2159  Unknown               Unknown  Unknown
ww3_shel           00000000014D2C26  Unknown               Unknown  Unknown
ww3_shel           00000000014C8517  Unknown               Unknown  Unknown
ww3_shel           00000000014CA7A4  Unknown               Unknown  Unknown
ww3_shel           00000000014CA9B6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1166  Unknown               Unknown  Unknown
ww3_shel           00000000014C2F93  Unknown               Unknown  Unknown
ww3_shel           00000000014C32E6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1F08  Unknown               Unknown  Unknown
ww3_shel           00000000014C0B8F  Unknown               Unknown  Unknown
ww3_shel           00000000014BCFDE  Unknown               Unknown  Unknown
ww3_shel           00000000014BC1B2  Unknown               Unknown  Unknown
ww3_shel           00000000014BC33B  Unknown               Unknown  Unknown
ww3_shel           00000000014BB5A0  Unknown               Unknown  Unknown
ww3_shel           00000000014BB3F3  Unknown               Unknown  Unknown
ww3_shel           00000000012AC206  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           0000000001299D9A  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010FAEA4  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           000000000089AB7D  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000445E31  MAIN__                   1903  ww3_shel.F90

Building with:
-DSCOTCH_NOAA_DEBUG2=ON
Error:

       Wave model ...
Fatal error in MPI_Irecv: Message truncated, error stack:
MPI_Irecv(170)......................: MPI_Irecv(buf=0x2bd2c80, count=20, MPI_INT, src=703, tag=300, MPI_COMM_WORLD, request=0x28439f8) failed
MPIDI_CH3U_Request_unpack_uebuf(618): Message truncated; 280 bytes received but buffer size is 80

Building with:
-DSCOTCH_NOAA_DEBUG_2=ON
Error:
Building with:
-DSCOTCH_NOAA_DEBUG_3=ON
Error:

       Wave model ...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           0000000001542FAD  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B2AF0D4C630  Unknown               Unknown  Unknown
ww3_shel           00000000014E1B2F  Unknown               Unknown  Unknown
ww3_shel           00000000014D2C30  Unknown               Unknown  Unknown
ww3_shel           00000000014C8517  Unknown               Unknown  Unknown
ww3_shel           00000000014CA7A4  Unknown               Unknown  Unknown
ww3_shel           00000000014CA9B6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1166  Unknown               Unknown  Unknown
ww3_shel           00000000014C2F93  Unknown               Unknown  Unknown
ww3_shel           00000000014C32E6  Unknown               Unknown  Unknown
ww3_shel           00000000014C1F08  Unknown               Unknown  Unknown
ww3_shel           00000000014C0B8F  Unknown               Unknown  Unknown
ww3_shel           00000000014BCFDE  Unknown               Unknown  Unknown
ww3_shel           00000000014BC1B2  Unknown               Unknown  Unknown
ww3_shel           00000000014BC33B  Unknown               Unknown  Unknown
ww3_shel           00000000014BB5A0  Unknown               Unknown  Unknown
ww3_shel           00000000014BB3F3  Unknown               Unknown  Unknown
ww3_shel           00000000012AC206  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           0000000001299D9A  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010FAEA4  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           000000000089AB7D  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000445E31  MAIN__                   1903  ww3_shel.F90

@aronroland
Copy link
Collaborator

aronroland commented Mar 31, 2023

@JessicaMeixner-NOAA, @MatthewMasarik-NOAA, as we have yesterday agreed I have provided the how-to build scotch in debug, performance, and further instrumentalization are given in #964 in the discussion section using gnu make. This implies of course that this needs to be applied in combination with #927 from the issue section.

@JessicaMeixner-NOAA
Copy link
Collaborator

@aronroland Thanks for pointing this out, I missed the other thread with the build info despite looking for it. Happy to switch to this new build instructions and build flags. In the meantime I have some updates from running with Intel 2021 that I'll share since those runs are in the queue.

@aronroland
Copy link
Collaborator

aronroland commented Mar 31, 2023

@aronroland Thanks for pointing this out, I missed the other thread with the build info despite looking for it. Happy to switch to this new build instructions and build flags. In the meantime I have some updates from running with Intel 2021 that I'll share since those runs are in the queue.

Hi @JessicaMeixner-NOAA, please correct/modify/add/question anything that is not clear since I really like to unify everything in such a way that it is understandable for everybody and this may be difficult for me since I am deep inside of this and maybe I do not explain this in a way that it is broadly understandable. Thanks for your help in advance.

Saying this I see that the c-flags in the debugging makefile for impi part could be further expanded but I like to have this in the SCOTCH repo. Therefore I will experiment a bit with this part, adjust with the SCOTCH team and provide a further expanded debug makefile for the "c" language using intel compiler and gnu make for SCOTCH. I think that we can go forward with this but expect more Monday.

I was also not sure if the "idea" section is the right place to put but I do not feel that this is like an issue. So feel free to move it anywhere else, where u think it is appropriate. Thanks in advance.

@aronroland
Copy link
Collaborator

It was asked by @MatthewMasarik-NOAA, which compiler flags we should use when. Thanks for this question. I have extended #927 in order to answer your important question. Please let me know if this helps.

@JessicaMeixner-NOAA
Copy link
Collaborator

JessicaMeixner-NOAA commented Apr 3, 2023

Here is output from building SCOTCH with CMAKE,

intel/2022.1.2
impi/2022.1.2

First run with the following:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_1"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_1"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_1"

Error output:

       Wave model ...
[h25c44:93497:0:93497] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x30e2bf0)
==== backtrace (tid:  93497) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155cde __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000567b85 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000567b85 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000567b85 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000567b85 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000567b85 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000567b85 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000567b85 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000567b85 PMPI_Irecv()  /build/impi/_buildspace/release/../../src/mpi/pt2pt/irecv.c:139
12 0x00000000014c993c _SCOTCHdgraphMatchSyncPtop()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_ptop.c:204
13 0x00000000014ba31a _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
14 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
15 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
16 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
17 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
18 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
19 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
20 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
21 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
22 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
23 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
24 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
25 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
26 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
27 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/PDLIB/yowpdlibmain.F90:632
28 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/PDLIB/yowpdlibmain.F90:127
29 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/w3profsmd_pdlib.F90:265
30 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/w3initmd.F90:750
31 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/ww3_shel.F90:1903
32 0x0000000000407ea2 main()  ???:0
33 0x0000000000022555 __libc_start_main()  ???:0
34 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152427A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B2373B33630  Unknown               Unknown  Unknown
libc-2.17.so       00002B2373E95CDE  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002B24C0BA981B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002B24C093BDBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B237264EB85  MPI_Irecv             Unknown  Unknown
ww3_shel           00000000014C993C  Unknown               Unknown  Unknown
ww3_shel           00000000014BA31A  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002B2373D62555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown

For an example of which flags are used in compilation here's a line from the scotch make output:

cd /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/build/src/libscotch && /apps/oneapi/mpi/2021.5.1/bin/mpiicc -DCOMMON_FILE_COMPRESS_BZ2 -DCOMMON_FILE_COMPRESS_GZ -DCOMMON_FILE_COMPRESS_LZMA -DCOMMON_RANDOM_FIXED_SEED -DSCOTCH_DEBUG_LIBRARY1 -DSCOTCH_PATCHLEVEL_NUM=3 -DSCOTCH_RELEASE_NUM=0 -DSCOTCH_RENAME -DSCOTCH_VERSION_NUM=7 -Drestrict=__restrict -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/build/src/libscotch -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/build/src/include -DSCOTCH_NOAA_DEBUG_1 -g -MD -MT src/libscotch/CMakeFiles/scotch.dir/bgraph_bipart_gg.c.o -MF CMakeFiles/scotch.dir/bgraph_bipart_gg.c.o.d -o CMakeFiles/scotch.dir/bgraph_bipart_gg.c.o -c /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/src/libscotch/bgraph_bipart_gg.c

and from WW3:
cd /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/build/model/src && /apps/oneapi/mpi/2021.5.1/bin/mpiifort -DENDIANNESS="'big_endian'" -DW3_BS0 -DW3_BT1 -DW3_CRT1 -DW3_CRX1 -DW3_DB1 -DW3_DIST -DW3_FLD2 -DW3_FLX0 -DW3_IC0 -DW3_IS0 -DW3_MLIM -DW3_MPI -DW3_NL1 -DW3_NOGRB -DW3_O0 -DW3_O1 -DW3_O14 -DW3_O15 -DW3_O2 -DW3_O3 -DW3_O4 -DW3_O5 -DW3_O6 -DW3_O7 -DW3_PDLIB -DW3_PR3 -DW3_REF0 -DW3_RWND -DW3_SCOTCH -DW3_SEED -DW3_ST4 -DW3_STAB0 -DW3_TR0 -DW3_UQ -DW3_WNT1 -DW3_WNX1 -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/build/model/src/mod -I/scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/intel-2022.1.2/impi-2022.1.2/netcdf/4.7.4/include -I/scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/scotch-7.0.3noaa2/install/include -g -module mod -no-fma -ip -g -traceback -i4 -real-size 32 -fp-model precise -assume byterecl -fno-alias -fno-fnalias -O0 -debug all -warn all -check all -check noarg_temp_created -fp-stack-check -heap-arrays -fpe0 -c /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_1/WW3/model/src/PDLIB/yowdatapool.F90 -o CMakeFiles/ww3_lib.dir/PDLIB/yowdatapool.F90.o

Second run with the following:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_2"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_2"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_2"

Error output:

       Wave model ...
[h36m10:264526:0:264526] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x315a8c0)
==== backtrace (tid: 264526) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155cde __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000567b85 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000567b85 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000567b85 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000567b85 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000567b85 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000567b85 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000567b85 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000567b85 PMPI_Irecv()  /build/impi/_buildspace/release/../../src/mpi/pt2pt/irecv.c:139
12 0x00000000014c9911 _SCOTCHdgraphMatchSyncPtop()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_ptop.c:204
13 0x00000000014ba324 _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
14 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
15 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
16 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
17 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
18 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
19 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
20 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
21 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
22 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
23 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
24 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
25 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
26 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
27 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/PDLIB/yowpdlibmain.F90:632
28 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/PDLIB/yowpdlibmain.F90:127
29 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/w3profsmd_pdlib.F90:265
30 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/w3initmd.F90:750
31 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_2/WW3/model/src/ww3_shel.F90:1903
32 0x0000000000407ea2 main()  ???:0
33 0x0000000000022555 __libc_start_main()  ???:0
34 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152421A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B1ED3F00630  Unknown               Unknown  Unknown
libc-2.17.so       00002B1ED4262CDE  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002B2020F7681B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002B2020D08DBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B1ED2A1BB85  MPI_Irecv             Unknown  Unknown
ww3_shel           00000000014C9911  Unknown               Unknown  Unknown
ww3_shel           00000000014BA324  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002B1ED412F555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown
srun: error: h36m10: task 2799: Exited with exit code 174

Third run with the following:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_3"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_3"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_3"

Error output:

       Wave model ...
[h36m52:271096:0:271096] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x44e0640)
==== backtrace (tid: 271096) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155c4b __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000404320 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000404320 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000404320 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000404320 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000404320 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000404320 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000404320 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000404320 MPIC_Irecv()  /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
12 0x000000000014133b MPIR_Alltoallv_intra_scattered_impl()  /build/impi/_buildspace/release/../../src/mpi/coll/intel/alltoallv/alltoallv_intra_scattered.c:186
13 0x00000000001927a8 MPIDI_NM_mpi_alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:643
14 0x00000000001927a8 MPIDI_Alltoallv_intra_composition_alpha()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1794
15 0x00000000001927a8 MPID_Alltoallv_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:2276
16 0x00000000001927a8 MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3335
17 0x00000000001717ec MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
18 0x00000000002b44df MPID_Alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:240
19 0x0000000000142405 PMPI_Alltoallv()  /build/impi/_buildspace/release/../../src/mpi/coll/alltoallv/alltoallv.c:351
20 0x00000000014c8524 _SCOTCHdgraphMatchSyncColl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_coll.c:210
21 0x00000000014ba324 _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
22 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
23 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
24 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
25 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
26 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
27 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
28 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
29 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
30 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
31 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
32 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
33 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
34 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
35 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/PDLIB/yowpdlibmain.F90:632
36 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/PDLIB/yowpdlibmain.F90:127
37 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/w3profsmd_pdlib.F90:265
38 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/w3initmd.F90:750
39 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_3/WW3/model/src/ww3_shel.F90:1903
40 0x0000000000407ea2 main()  ???:0
41 0x0000000000022555 __libc_start_main()  ???:0
42 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152421A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B464B1C4630  Unknown               Unknown  Unknown
libc-2.17.so       00002B464B526C4B  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002B479823A81B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002B4797FCCDBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B4649B7C320  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B46498B933B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B464990A7A8  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B46498E97EC  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B4649A2C4DF  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002B46498BA405  PMPI_Alltoallv        Unknown  Unknown
ww3_shel           00000000014C8524  Unknown               Unknown  Unknown
ww3_shel           00000000014BA324  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002B464B3F3555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown
srun: error: h36m52: task 2799: Exited with exit code 174

Forth run with all 3:

export CFLAGS="-DSCOTCH_NOAA_DEBUG_1 -DSCOTCH_NOAA_DEBUG_2 -DSCOTCH_NOAA_DEBUG_3"
export CPPFLAGS="-DSCOTCH_NOAA_DEBUG_1 -DSCOTCH_NOAA_DEBUG_2 -DSCOTCH_NOAA_DEBUG_3"
export CXXFLAGS="-DSCOTCH_NOAA_DEBUG_1 -DSCOTCH_NOAA_DEBUG_2 -DSCOTCH_NOAA_DEBUG_3"

error:

       Wave model ...
[h36m52:271536:0:271536] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2a51540)
==== backtrace (tid: 271536) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000000155c4b __memcpy_ssse3_back()  :0
 2 0x000000000003581b ucp_tag_recv_nb()  ???:0
 3 0x000000000000bdbb mlx_tagged_recv()  mlx_tagged.c:0
 4 0x0000000000404320 fi_trecv()  /p/pdsd/scratch/Uploads/IMPI/other/software/libfabric/linux/v1.9.0/include/rdma/fi_tagged.h:91
 5 0x0000000000404320 MPIDI_OFI_do_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_recv.h:127
 6 0x0000000000404320 MPIDI_NM_mpi_irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/ofi_recv.h:377
 7 0x0000000000404320 MPIDI_irecv_handoff()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:81
 8 0x0000000000404320 MPIDI_irecv_unsafe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:238
 9 0x0000000000404320 MPIDI_irecv_safe()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:558
10 0x0000000000404320 MPID_Irecv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/ch4_recv.h:791
11 0x0000000000404320 MPIC_Irecv()  /build/impi/_buildspace/release/../../src/mpi/coll/helper_fns.c:625
12 0x000000000014133b MPIR_Alltoallv_intra_scattered_impl()  /build/impi/_buildspace/release/../../src/mpi/coll/intel/alltoallv/alltoallv_intra_scattered.c:186
13 0x00000000001927a8 MPIDI_NM_mpi_alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/netmod/include/../ofi/intel/ofi_coll.h:643
14 0x00000000001927a8 MPIDI_Alltoallv_intra_composition_alpha()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_impl.h:1794
15 0x00000000001927a8 MPID_Alltoallv_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:2276
16 0x00000000001927a8 MPIDI_coll_invoke()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_select_utils.c:3335
17 0x00000000001717ec MPIDI_coll_select()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll_globals_default.c:130
18 0x00000000002b44df MPID_Alltoallv()  /build/impi/_buildspace/release/../../src/mpid/ch4/src/intel/ch4_coll.h:240
19 0x0000000000142405 PMPI_Alltoallv()  /build/impi/_buildspace/release/../../src/mpi/coll/alltoallv/alltoallv.c:351
20 0x00000000014c852c _SCOTCHdgraphMatchSyncColl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/dgraph_match_sync_coll.c:210
21 0x00000000014ba32e _SCOTCHdgraphCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/dgraph_coarsen.c:1377
22 0x00000000014afbcb bdgraphBipartMlCoarsen()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:114
23 0x00000000014b1e77 bdgraphBipartMl2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:775
24 0x00000000014b2089 _SCOTCHbdgraphBipartMl()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_ml.c:826
25 0x00000000014a880a _SCOTCHbdgraphBipartSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/bdgraph_bipart_st.c:377
26 0x00000000014aa63a kdgraphMapRbPart2()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:387
27 0x00000000014aa98d _SCOTCHkdgraphMapRbPart()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb_part.c:437
28 0x00000000014a95ac _SCOTCHkdgraphMapRb()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_rb.c:261
29 0x00000000014a8233 _SCOTCHkdgraphMapSt()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/kdgraph_map_st.c:186
30 0x00000000014a467e SCOTCH_dgraphMapCompute()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotch/library_dgraph_map.c:191
31 0x00000000014a3852 SCOTCH_ParMETIS_V3_PartKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:182
32 0x00000000014a39db SCOTCH_ParMETIS_V3_PartGeomKway()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part.c:230
33 0x00000000014a2c40 SCOTCH_PARMETIS_V3_PARTGEOMKWAY()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:118
34 0x00000000014a2a93 scotch_parmetis_v3_partgeomkway_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/scotch-7.0.3noaa2/src/libscotchmetis/parmetis_dgraph_part_f.c:96
35 0x0000000001291743 yowpdlibmain_mp_runparmetis_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/PDLIB/yowpdlibmain.F90:632
36 0x000000000127f541 yowpdlibmain_mp_initfromgriddim_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/PDLIB/yowpdlibmain.F90:127
37 0x00000000010dbe3d pdlib_w3profsmd_mp_pdlib_init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/w3profsmd_pdlib.F90:265
38 0x00000000008a02e6 w3initmd_mp_w3init_()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/w3initmd.F90:750
39 0x0000000000447f36 MAIN__()  /scratch1/NCEPDEV/climate/Jessica.Meixner/scotchtest/test20_123/WW3/model/src/ww3_shel.F90:1903
40 0x0000000000407ea2 main()  ???:0
41 0x0000000000022555 __libc_start_main()  ???:0
42 0x0000000000407da9 _start()  ???:0
=================================
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ww3_shel           000000000152428A  Unknown               Unknown  Unknown
libpthread-2.17.s  00002ACC9037B630  Unknown               Unknown  Unknown
libc-2.17.so       00002ACC906DDC4B  Unknown               Unknown  Unknown
libucp.so.0.0.0    00002ACDDD3F181B  ucp_tag_recv_nb       Unknown  Unknown
libmlx-fi.so       00002ACDDD183DBB  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8ED33320  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EA7033B  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EAC17A8  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EAA07EC  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EBE34DF  Unknown               Unknown  Unknown
libmpi.so.12.0.0   00002ACC8EA71405  PMPI_Alltoallv        Unknown  Unknown
ww3_shel           00000000014C852C  Unknown               Unknown  Unknown
ww3_shel           00000000014BA32E  Unknown               Unknown  Unknown
ww3_shel           00000000014AFBCB  Unknown               Unknown  Unknown
ww3_shel           00000000014B1E77  Unknown               Unknown  Unknown
ww3_shel           00000000014B2089  Unknown               Unknown  Unknown
ww3_shel           00000000014A880A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA63A  Unknown               Unknown  Unknown
ww3_shel           00000000014AA98D  Unknown               Unknown  Unknown
ww3_shel           00000000014A95AC  Unknown               Unknown  Unknown
ww3_shel           00000000014A8233  Unknown               Unknown  Unknown
ww3_shel           00000000014A467E  Unknown               Unknown  Unknown
ww3_shel           00000000014A3852  Unknown               Unknown  Unknown
ww3_shel           00000000014A39DB  Unknown               Unknown  Unknown
ww3_shel           00000000014A2C40  Unknown               Unknown  Unknown
ww3_shel           00000000014A2A93  Unknown               Unknown  Unknown
ww3_shel           0000000001291743  yowpdlibmain_mp_r         632  yowpdlibmain.F90
ww3_shel           000000000127F541  yowpdlibmain_mp_i         127  yowpdlibmain.F90
ww3_shel           00000000010DBE3D  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90
ww3_shel           00000000008A02E6  w3initmd_mp_w3ini         750  w3initmd.F90
ww3_shel           0000000000447F36  MAIN__                   1903  ww3_shel.F90
ww3_shel           0000000000407EA2  Unknown               Unknown  Unknown
libc-2.17.so       00002ACC905AA555  __libc_start_main     Unknown  Unknown
ww3_shel           0000000000407DA9  Unknown               Unknown  Unknown
srun: error: h36m52: task 2799: Exited with exit code 174

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Hi all, this is a repost from the email thread in case it was missed. It is a traceback from when I started to test with GNU (meaning no Intel), so is a data point to compare with @JessicaMeixner-NOAA's Intel results just posted. It also shows the assessment of SCOTCH_ParMETIS_V3_PartGeomKway() routine being where things go awry, and work I had started to inspect within that SCOTCH routine, as well as checking for consistency of input args from WW3 to that routine.


The SCOTCH routine we use is SCOTCH_ParMETIS_V3_PartGeomKway(). The program is crashing during this subroutine.

That routine calls SCOTCH_ParMETIS_V3_PartKway(), which calls SCOTCH_dgraphBuild(), and then calls SCOTCH_dgraphCheck() (there's a #ifdef SCOTCH_DEBUG_ALL, I commented this #ifdef out while testing to be sure program execution goes through it). I added some print statements around these two calls as shown
scotchparmetis

It appears to be dying in SCOTCH_dgraphBuild(). A couple of the fastest process make it to 3. and 4. before the whole thing chokes, but it does seem to be having an issue somewhere in SCOTCH_dgraphBuild(), and is not making it to SCOTCH_dgraphCheck().
Here's the traceback
traceback

From the traceback the first intelligible record is for _SCOTCHdgraphCoarsen.

Since the problem seems to be in SCOTCH_ParMETIS_V3_PartGeomKway(), I've started to check the sizes/types of input to this routine. There are some related variables that are hard-coded ( REAL(4) ) or double precision, and some manual conversion between the two. To check these I've done some writing of these values out and inspect them manually. If closer inspection of the input args is needed, please let me know and I will follow this route further.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

For current efforts, I am working to get the GNU make of SCOTCH + WW3 working as was requested. I have been able to get SCOTCH to compile this way so far, but am needing to troubleshoot the WW3 build using the output from GNU make SCOTCH. In summary, I believe the SCOTCH portions of our WW3 cmake build have been developed based on the output from the cmake build of SCOTCH. The output of the GNU make differs in someway that WW3's cmake is having a hard time using. I don't anticipate this being too difficult to solve though, so will hopefully have more to report soon.

@aronroland
Copy link
Collaborator

Hi @MatthewMasarik-NOAA, sure let's make a meeting with @JessicaMeixner-NOAA and check where we are wtr. to the work schedule, maybe @thesser1 can join us and we can discuss the actual state of work.

@JessicaMeixner-NOAA
Copy link
Collaborator

@aronroland I'm on leave Weds-Friday, so I'll set up a time for today. Should we invite Francois too, as we have some additional output that perhaps he can provide feedback on?

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Hi @MatthewMasarik-NOAA, sure let's make a meeting with @JessicaMeixner-NOAA and check where we are wtr. to the work schedule, maybe @thesser1 can join us and we can discuss the actual state of work.

Sure thing, @aronroland. I'm looking forward to discussing today.

@thesser1
Copy link
Collaborator

thesser1 commented Apr 4, 2023 via email

@MatthewMasarik-NOAA
Copy link
Collaborator Author

SCOTCH_NOAA_DEBUG + SCOTCH GNU make

Results for running the three NOAA debug flags separately in the instrumented 'noaa2' SCOTCH repo. SCOTCH is built using the traditional/GNU make system with compiler/MPI, intel/impi 2022.1.2. The Makefile.inc.x86-64_pc_linux2.icc.impi.debug is used which has compile options

CFLAGS		= -g -O0 -fp-model strict -traceback -fp-stack-check -DCOMMON_FILE_COMPRESS_GZ -DCOMMON_PTHREAD -DCOMMON_PTHREAD_AFFINITY_LINUX -DCOMMON_RANDOM_FIXED_SEED -DSCOTCH_DEBUG_ALL -DSCOTCH_DETERMINISTIC -DSCOTCH_MPI_ASYNC_COLL -DSCOTCH_PTHREAD -DSCOTCH_PTHREAD_MPI -DSCOTCH_RENAME -restrict -DIDXSIZE64

along with each -DSCOTCH_NOAA_DEBUG_[1,2,3] flag appended to CFLAGS as verified in the noaa-[1,2,3]-scotch-make.out.txt files in each section below.

SCOTCH_NOAA_DEBUG_1

noaa-1-scotch-make.out.txt
noaa-1-slurm-43586053.out.truncated.txt

  • Traceback (see noaa-1-slurm-43586053.out.truncated.txt for more output)
       Type 4 : Restart files
      -----------------------------------------
            From     : 2020/08/15 00:00:00 UTC
            To       : 2020/08/20 00:00:00 UTC
            Interval :          1 00:00:00

            output dates out of run dates : Track point output deactivated
            output dates out of run dates : Nesting data deactivated
            output dates out of run dates : Partitioned wave field data deactivated
            output dates out of run dates : Restart files second request deactivated
       Wave model ...
(736): ERROR: dgraphCoarsen: invalid matching
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(2599): ERROR: dgraphCoarsen: invalid matching
(162): ERROR: dgraphCoarsen: invalid matching
(2089): ERROR: dgraphCoarsen: invalid matching
(1522): ERROR: dgraphCoarsen: invalid matching
(1857): ERROR: dgraphCoarsen: invalid matching
(1465): ERROR: dgraphCoarsen: invalid matching
(1058): ERROR: dgraphCoarsen: invalid matching

SCOTCH_NOAA_DEBUG_2

noaa-2-scotch-make.out.txt
noaa-2-slurm-43586629.out.truncated.txt

  • Traceback (see noaa-2-slurm-43586629.out.truncated.txt for more output)
      Type 4 : Restart files
      -----------------------------------------
            From     : 2020/08/15 00:00:00 UTC
            To       : 2020/08/20 00:00:00 UTC
            Interval :          1 00:00:00

            output dates out of run dates : Track point output deactivated
            output dates out of run dates : Nesting data deactivated
            output dates out of run dates : Partitioned wave field data deactivated
            output dates out of run dates : Restart files second request deactivated
       Wave model ...
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(1722): ERROR: dgraphCoarsen: invalid matching
(2520): ERROR: dgraphCoarsen: invalid matching
(1880): ERROR: dgraphCoarsen: invalid matching
(840): ERROR: dgraphCoarsen: invalid matching
(1961): ERROR: dgraphCoarsen: invalid matching
(2204): ERROR: dgraphCoarsen: invalid matching

SCOTCH_NOAA_DEBUG_3

noaa-3-scotch-make.out.txt
noaa-3-slurm-43586236.out.truncated.txt

  • Traceback (see noaa-3-slurm-43586236.out.truncated.txt for more output)
       Type 4 : Restart files
      -----------------------------------------
            From     : 2020/08/15 00:00:00 UTC
            To       : 2020/08/20 00:00:00 UTC
            Interval :          1 00:00:00

            output dates out of run dates : Track point output deactivated
            output dates out of run dates : Nesting data deactivated
            output dates out of run dates : Partitioned wave field data deactivated
            output dates out of run dates : Restart files second request deactivated
       Wave model ...
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(25): ERROR: dgraphCoarsen: invalid matching
(1641): ERROR: dgraphCoarsen: invalid matching
(2542): ERROR: dgraphCoarsen: invalid matching
(1925): ERROR: dgraphCoarsen: invalid matching
(1450): ERROR: dgraphCoarsen: invalid matching
(1253): ERROR: dgraphCoarsen: invalid matching
(124): ERROR: dgraphCoarsen: invalid matching
(1819): ERROR: dgraphCoarsen: invalid matching
(600): ERROR: dgraphCoarsen: invalid matching
(200): ERROR: dgraphCoarsen: invalid matching

@JessicaMeixner-NOAA
Copy link
Collaborator

I've run the test case building with intel/2022.1.2 and mvapich2/2.3 last week and again today and my job just hangs and I get no output. I believe this is consistent with what @thesser1 reported as well.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Trying to understand why each of the three tests (-DSCOTCH_NOAA_DEBUG_1,2,3) gave the same output, I re-ran them but this time removed the -DSCOTCH_DEBUG_ALL flag. I also removed the -DSCOTCH_PTHREAD and -DSCOTCH_PTHREAD_MPI since we have not been using those. This gave essentially the same output each time, though it is the realloc error that has been shared before

       Type 4 : Restart files                                                                                                      
      -----------------------------------------                                                                                    
            From     : 2020/08/15 00:00:00 UTC                                                                                     
            To       : 2020/08/20 00:00:00 UTC                                                                                     
            Interval :          1 00:00:00                                                                                         
                                                                                                                                   
            output dates out of run dates : Track point output deactivated                                                         
            output dates out of run dates : Nesting data deactivated                                                               
            output dates out of run dates : Partitioned wave field data deactivated                                                
            output dates out of run dates : Restart files second request deactivated                                               
       Wave model ...                                                                                                              
*** Error in `/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel': realloc(): invalid next s\
ize: 0x000000000420eca0 ***                                                                                                        
======= Backtrace: =========                                                                                                       
/lib64/libc.so.6(+0x7f474)[0x2b1757cfa474]                                                                                         
/lib64/libc.so.6(+0x84861)[0x2b1757cff861]                                                                                         
/lib64/libc.so.6(realloc+0x1d2)[0x2b1757d00e12]                                                                                    
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c211a]                                
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14b4a5d]                                
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14b7449]

This is with flags

CFLAGS=-g -O0 -fp-model strict -traceback -fp-stack-check -DCOMMON_FILE_COMPRESS_GZ -DCOMMON_PTHREAD -DCOMMON_PTHREAD_AFFINITY_LINUX -DCOMMON_RANDOM_FIXED_SEED  -DSCOTCH_DETERMINISTIC -DSCOTCH_MPI_ASYNC_COLL -DSCOTCH_RENAME -restrict -DIDXSIZE64

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Update - Aron's debug flags

Reporting output for runs that use:

  • SCOTCH built by traditional make using the suggested Makefile.Inc for intel/debug
  • WW3 built with Aron's Fortran debug flags

For these builds the SCOTCH_NOAA_DEBUG_1,2,3 tests were each run separately. They produced output in each case that is similar, so I'll display the traceback for SCOTCH_NOAA_DEBUG_1 and post the related logs below.

       Type 4 : Restart files                                                                                                                                            
      -----------------------------------------                                                                                                                          
            From     : 2020/08/15 00:00:00 UTC                                                                                                                           
            To       : 2020/08/20 00:00:00 UTC                                                                                                                           
            Interval :          1 00:00:00                                                                                                                               
                                                                                                                                                                         
            output dates out of run dates : Track point output deactivated                                                                                               
            output dates out of run dates : Nesting data deactivated                                                                                                     
            output dates out of run dates : Partitioned wave field data deactivated                                                                                      
            output dates out of run dates : Restart files second request deactivated                                                                                     
       Wave model ...                                                                                                                                                    
*** Error in `/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel': realloc(): invalid next size: 0x0000000003c85230 ***            
======= Backtrace: =========                                                                                                                                             
/lib64/libc.so.6(+0x7f474)[0x2b537a529474]                                                                                                                               
/lib64/libc.so.6(+0x84861)[0x2b537a52e861]                                                                                                                               
/lib64/libc.so.6(realloc+0x1d2)[0x2b537a52fe12]                                                                                                                          
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14d4620]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c6f6d]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c9959]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c9cdb]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14bdb77]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c0745]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14c0c96]                                                                      
/scratch1/NCEPDEV/climate/Matthew.Masarik/projs/unstruct/scaling/runs/bin-noaa2/ww3_shel[0x14bf02c]
        .
 -- truncated --
        .
        .
7ffdf9bfd000-7ffdf9c3a000 rw-p 00000000 00:00 0                          [stack]                                                                                         
7ffdf9ce4000-7ffdf9ce6000 r-xp 00000000 00:00 0                          [vdso]                                                                                          
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]                                                                                      
forrtl: error (76): Abort trap signal                                                                                                                                    
Image              PC                Routine            Line        Source                                                                                               
ww3_shel           0000000001554A1B  Unknown               Unknown  Unknown                                                                                              
libpthread-2.17.s  00002B537A29D630  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A4E0387  gsignal               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A4E1A78  abort                 Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A522F67  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A529474  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A52E861  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A52FE12  realloc               Unknown  Unknown                                                                                              
ww3_shel           00000000014D4620  _SCOTCHdgraphCoar        1436  dgraph_coarsen.c                                                                                     
ww3_shel           00000000014C6F6D  bdgraphBipartMlCo         114  bdgraph_bipart_ml.c                                                                                  
ww3_shel           00000000014C9959  bdgraphBipartMl2          775  bdgraph_bipart_ml.c                                                                                  
ww3_shel           00000000014C9CDB  _SCOTCHbdgraphBip         826  bdgraph_bipart_ml.c                                                                                  
ww3_shel           00000000014BDB77  _SCOTCHbdgraphBip         377  bdgraph_bipart_st.c                                                                                  
ww3_shel           00000000014C0745  kdgraphMapRbPart2         387  kdgraph_map_rb_part.c                                                                                
ww3_shel           00000000014C0C96  _SCOTCHkdgraphMap         437  kdgraph_map_rb_part.c                                                                                
ww3_shel           00000000014BF02C  _SCOTCHkdgraphMap         261  kdgraph_map_rb.c                                                                                     
ww3_shel           00000000014BD13E  _SCOTCHkdgraphMap         186  kdgraph_map_st.c                                                                                     
ww3_shel           00000000014B7A0F  SCOTCH_dgraphMapC         191  library_dgraph_map.c                                                                                 
ww3_shel           00000000014B66D4  SCOTCH_ParMETIS_V         182  parmetis_dgraph_part.c                                                                               
ww3_shel           00000000014B6971  SCOTCH_ParMETIS_V         230  parmetis_dgraph_part.c                                                                               
ww3_shel           00000000014B5612  SCOTCH_PARMETIS_V         118  parmetis_dgraph_part_f.c                                                                             
ww3_shel           00000000014B5409  scotch_parmetis_v          96  parmetis_dgraph_part_f.c                                                                             
ww3_shel           00000000012A1598  yowpdlibmain_mp_r         632  yowpdlibmain.F90                                                                                     
ww3_shel           000000000128F15A  yowpdlibmain_mp_i         127  yowpdlibmain.F90                                                                                     
ww3_shel           00000000010E886A  pdlib_w3profsmd_m         265  w3profsmd_pdlib.F90                                                                                  
ww3_shel           00000000008A2BF0  w3initmd_mp_w3ini         750  w3initmd.F90                                                                                         
ww3_shel           000000000044910C  MAIN__                   1903  ww3_shel.F90                                                                                         
ww3_shel           0000000000408562  Unknown               Unknown  Unknown                                                                                              
libc-2.17.so       00002B537A4CC555  __libc_start_main     Unknown  Unknown                                                                                              
ww3_shel           0000000000408469  Unknown               Unknown  Unknown                                                                                              
srun: error: h34m52: task 2599: Aborted (core dumped)                                                                                                                    
srun: launch/slurm: _step_signal: Terminating StepId=43728715.0                                                                                                          
slurmstepd: error: *** STEP 43728715.0 ON h1c17 CANCELLED AT 2023-04-11T19:11:02 ***                                                                                     
forrtl: error (78): process killed (SIGTERM) 

SCOTCH_NOAA_DEBUG_1

debug1.scotch.make.out.txt
debug1.ww3.make.out.txt
debug1.slurm-43728715.out.truncated.txt

SCOTCH_NOAA_DEBUG_2

debug2.scotch.make.out.txt
debug2.ww3.make.out.txt
debug2.slurm-43728258.out.truncated.txt

SCOTCH_NOAA_DEBUG_3

debug3.scotch.make.out.txt
debug3.ww3.make.out.txt
debug3.slurm-43729063.out.truncated.txt

The same runs were done with fprintf statements, which confirmed that each of the added #ifdef SCOTCH_NOAA_DEBUG_* blocks were entered. These files are not included because they give the same information as above, but get really heavy with all the write statements to stderr. When running with 2600 MPI tasks, the counts of write statements upon entry of each block are:

  • SCOTCH_NOAA_DEBUG_1 (A): 1635560
  • SCOTCH_NOAA_DEBUG_1 (B): 1635560
  • SCOTCH_NOAA_DEBUG_2: 13000
  • SCOTCH_NOAA_DEBUG_3: 13000

Q: I've been compiling SCOTCH (both cmake and now traditional make) without SCOTCH_PTHREADS or SCOTCH_PTHREAD_MPI set, then running with 2600 MPI tasks. Do these number seem right -- should we be seeing these high of counts?

Putting it together

From these tracebacks we are clearly having a crash in dgraph_coarsen.c (line 1436), with a realloc error.

And from the tracebacks with SCOTCH_DEBUG_ALL set, from the comment above it also shows dgraphCoarsen as the point of failure

       Wave model ...
(736): ERROR: dgraphCoarsen: invalid matching
(2599): ERROR: _SCOTCHdgraphMatchLc: undersized multinode array (4)
(2599): ERROR: dgraphMatchCheck: unmatched local vertex
(2599): ERROR: dgraphCoarsen: invalid matching

with an error message mentioning undersized multinode array (4) in the Match.

Line 1436 in dgraph_coarsen.c has a call to `memRealloc(...) to resize a multinode array.

1434: if (matedat.c.multloctmp != NULL) {             /* If we allocated the multinode array */                                                                              
1435:   matedat.c.multloctmp =                                                                                                                                               
1436:   matedat.c.multloctab = memRealloc (matedat.c.multloctab, matedat.c.multlocnbr * sizeof (DgraphCoarsenMulti)); /* Resize multinode array */                           
1437: }       

Is line 1435 intended to be the way it is? It may be. I'm guessing this syntax would evaluate both lines 1435-1436 as a single command, being a multiple variable assignment to the output of memRealloc.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

This issue was resolved within SCOTCH by release 7.0.4. A scotch/7.0.4 module has previously been added to spack-stack/1.5.0 and installed and tested on RDHPCS machines. scotch/7.0.4 has also been installed on WCOSS2. Final check to confirm scalability was done by @JessicaMeixner-NOAA on cactus (~22 dec 2023) by running WW3 coupled with 6000 PETs for the wave component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Development

No branches or pull requests

6 participants