Add machine config file for RZAnsel #377

JoshuaSBrown · 2020-11-23T17:12:53Z

PR Summary

This adds a Machine Config file for RZAnsel. This should make it easier for users to get up and running with a working build. Using the ideas that @pgrete had in the Summit config file three variants are allowed to be built, "cuda", "mpi", "cuda+mpi".

Note that I have not removed the old documentation for building on RZAnasel because this was erroneously altered from the Summit documentation. As per my conversation with @pgrete he was going to fix it, so I did not want to touch it.

PR Checklist

Code passes cpplint
New features are documented.
Code is formatted
Changes are summarized in CHANGELOG.md

JoshuaSBrown · 2021-01-04T21:53:07Z

Waiting for this https://re-git.lanl.gov/eap-oss/parthenon-project/-/merge_requests/2 to be merged first.

JoshuaSBrown · 2021-01-04T22:00:48Z

This is now ready for review with the exception of these lines:

#set(RZANSEL_PROJECT_PREFIX /usr/gapps/parthenon_shared/parthenon-project
#    CACHE STRING "Path to parthenon-project checkout")

set(RZANSEL_PROJECT_PREFIX /g/g15/brown338/Software/parthenon-project
    CACHE STRING "Path to parthenon-project checkout")

I will fix this once the parthenon-project repo is updated and it is updated in RZAnsel.

JoshuaSBrown · 2021-01-04T22:23:40Z

Ok, this is good to go.

Co-authored-by: Philipp Grete <gretephi@msu.edu>

JoshuaSBrown · 2021-01-20T17:20:27Z

@pgrete I could use some help with this, either there is some logic in parthenon that needs to be corrected or the tests simply are not configured to run with 4 gpus and 4 mpi procs:

This first scenario should be the one that works:

I can verify that nvidia-smi indicates 4 tasks are launched.

    Start 34: regression_mpi_test:advection_performance
1/6 Test #34: regression_mpi_test:advection_performance ...***Failed   68.00 sec


test_dir=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance']
output_dir='/g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_performance_mpi'
driver=['/g/g15/brown338/Software/parthenon/build/example/advection/advection-example']
driver_input=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance']
kokkos_args=['--kokkos-num-devices=1 --kokkos-threads=1']
num_steps=5
mpirun=['/usr/tcetmp/bin/jsrun']
mpirun_opts=['-a', '4', "-c 1 -n 1 -g 1 -r 1 -d packed --smpiargs='-gpu'"]
coverage=False
*****************************************************************
Beginning Python regression testing script
*****************************************************************

Initializing Test Case
Using:
driver at:       /g/g15/brown338/Software/parthenon/build/example/advection/advection-example
driver input at: /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance
test folder:     /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance
output sent to:  /g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_performance_mpi

Make output folder in test if does not exist
*****************************************************************
Preparing Test Case Step 1
*****************************************************************

*****************************************************************
Running Driver
*****************************************************************

Command to execute driver
/usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 1 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338/Software/parthenon/build/example/advection/advection-example -i /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance parthenon/mesh/nx1=256 parthenon/meshblock/nx1=256 parthenon/mesh/nx2=256 parthenon/meshblock/nx2=256 parthenon/mesh/nx3=256 parthenon/meshblock/nx3=256 --kokkos-num-devices=1 --kokkos-threads=1

*****************************************************************
Subprocess error message
*****************************************************************

b'### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
'

    Start 34: regression_mpi_test:advection_performance
1/6 Test #34: regression_mpi_test:advection_performance ...***Failed   66.39 sec


test_dir=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance']
output_dir='/g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_performance_mpi'
driver=['/g/g15/brown338/Software/parthenon/build/example/advection/advection-example']
driver_input=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance']
kokkos_args=['--kokkos-num-devices=4 --kokkos-threads=1']
num_steps=5
mpirun=['/usr/tcetmp/bin/jsrun']
mpirun_opts=['-a', '4', "-c 1 -n 1 -g 1 -r 1 -d packed --smpiargs='-gpu'"]
coverage=False
*****************************************************************
Beginning Python regression testing script
*****************************************************************

Initializing Test Case
Using:
driver at:       /g/g15/brown338/Software/parthenon/build/example/advection/advection-example
driver input at: /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance
test folder:     /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance
output sent to:  /g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_performance_mpi

Make output folder in test if does not exist
*****************************************************************
Preparing Test Case Step 1
*****************************************************************

*****************************************************************
Running Driver
*****************************************************************

Command to execute driver
/usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 1 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338/Software/parthenon/build/example/advection/advection-example -i /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance parthenon/mesh/nx1=256 parthenon/meshblock/nx1=256 parthenon/mesh/nx2=256 parthenon/meshblock/nx2=256 parthenon/mesh/nx3=256 parthenon/meshblock/nx3=256 --kokkos-num-devices=4 --kokkos-threads=1

*****************************************************************
Subprocess error message
*****************************************************************

b''

*****************************************************************
Error detected while running subprocess command
*****************************************************************

Traceback (most recent call last):
  File "/g/g15/brown338/Software/parthenon/tst/regression/utils/test_case.py", line 237, in Run
    proc = subprocess.run(run_command, check=True, stdout=PIPE, stderr=PIPE)
  File "/usr/gapps/parthenon_shared/parthenon-project/views/rzansel/ppc64le/gcc8/2021-01-04/lib/python3.8/subprocess.py", line 512, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/tcetmp/bin/jsrun', '-a', '4', '-c', '1', '-n', '1', '-g', '1', '-r', '1', '-d', 'packed', "--smpiargs='-gpu'", '/g/g15/brown338/Software/parthenon/build/example/advection/advection-example', '-i', '/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance', 'parthenon/mesh/nx1=256', 'parthenon/meshblock/nx1=256', 'parthenon/mesh/nx2=256', 'parthenon/meshblock/nx2=256', 'parthenon/mesh/nx3=256', 'parthenon/meshblock/nx3=256', '--kokkos-num-devices=4', '--kokkos-threads=1']' returned non-zero exit status 134.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/g/g15/brown338/Software/parthenon/tst/regression/run_test.py", line 152, in <module>
    main(**vars(args))
  File "/g/g15/brown338/Software/parthenon/tst/regression/run_test.py", line 76, in main
    test_manager.Run()
  File "/g/g15/brown338/Software/parthenon/tst/regression/utils/test_case.py", line 247, in Run
    raise TestManagerError('\nReturn code {0} from command \'{1}\''
utils.test_case.TestManagerError: 
Return code 134 from command '/usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 1 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338/Software/parthenon/build/example/advection/advection-example -i /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance parthenon/mesh/nx1=256 parthenon/meshblock/nx1=256 parthenon/mesh/nx2=256 parthenon/meshblock/nx2=256 parthenon/mesh/nx3=256 parthenon/meshblock/nx3=256 --kokkos-num-devices=4 --kokkos-threads=1'

    Start 34: regression_mpi_test:advection_performance
1/6 Test #34: regression_mpi_test:advection_performance ...***Failed   68.12 sec


test_dir=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance']
output_dir='/g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_performance_mpi'
driver=['/g/g15/brown338/Software/parthenon/build/example/advection/advection-example']
driver_input=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance']
kokkos_args=['--kokkos-num-devices=1 --kokkos-threads=1']
num_steps=5
mpirun=['/usr/tcetmp/bin/jsrun']
mpirun_opts=['-a', '4', "-c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu'"]
coverage=False
*****************************************************************
Beginning Python regression testing script
*****************************************************************

Initializing Test Case
Using:
driver at:       /g/g15/brown338/Software/parthenon/build/example/advection/advection-example
driver input at: /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance
test folder:     /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance
output sent to:  /g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_performance_mpi

Make output folder in test if does not exist
*****************************************************************
Preparing Test Case Step 1
*****************************************************************

*****************************************************************
Running Driver
*****************************************************************

Command to execute driver
/usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338/Software/parthenon/build/example/advection/advection-example -i /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_performance/parthinput.advection_performance parthenon/mesh/nx1=256 parthenon/meshblock/nx1=256 parthenon/mesh/nx2=256 parthenon/meshblock/nx2=256 parthenon/mesh/nx3=256 parthenon/meshblock/nx3=256 --kokkos-num-devices=1 --kokkos-threads=1

*****************************************************************
Subprocess error message
*****************************************************************

b'### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
### PARTHENON ERROR
  Message:     ### FATAL ERROR in Mesh constructor
Too few mesh blocks: nbtotal (1) < nranks (4)

  File:        ../src/mesh/mesh.cpp
  Line number: 450
'

@jonahm-LANL I could use your input here as welll.

pgrete · 2021-01-20T23:23:44Z

This simple advection_performance test is currently setup to measure the overhead in overdecomposition, i.e., for a fixed mesh size (here 256^3) the number of meshblocks into which the Mesh is separated is successively increased (by decreasing the block size).
Thus, this test assumes that the number of compute elements (e.g., a GPU) remains constant and (at the same time) that there's only one compute element (as the baseline performance is one MeshBlock covering the entire Mesh).
We'd need to setup various different tests to capture various performance aspects (specifically thinking of the proxy app here given that it'll allow us to increase to compute versus infrastructure related pieces in the code to more realistic levels).

AndrewGaspar · 2021-01-21T23:10:00Z

@pgrete Yeah, that's all well and good - we should definitely ensure we're holding the num of GPUs constant when we're doing performance testing, which you can control using NUM_GPU_DEVICES_PER_NODE. I think all Josh is trying to ask is how do you get jsrun to correctly partition the GPUs.

pgrete · 2021-01-22T07:57:36Z

@pgrete Yeah, that's all well and good - we should definitely ensure we're holding the num of GPUs constant when we're doing performance testing, which you can control using NUM_GPU_DEVICES_PER_NODE. I think all Josh is trying to ask is how do you get jsrun to correctly partition the GPUs.

Sorry, my previous comment was probably not clear.
I was trying to say that this specific advection_performance performance test makes only sense when run on a single GPU as the baseline in that test is using a single MeshBlock for the entire Mesh. Thus, independent of CPU or GPU that test can only be run using a single rank (that's also where the error message comes from).
Thus, this test should not be run with
/usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338
but rather with
/usr/tcetmp/bin/jsrun -a 1 -c 1 -n 1 -g 1 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338

It may be worth to add safety check (i.e., ranks == 1) to that test similar to the advection_convergence test:

        # make sure we can evenly distribute the MeshBlock sizes
        err_msg = "Num ranks must be multiples of 2 for convergence test." 
        assert parameters.num_ranks == 1 or parameters.num_ranks % 2 == 0, err_msg                                             
        # ensure a minimum block size of 4
        assert lin_res[0] / parameters.num_ranks >= 4, "Use <= 8 ranks for convergence test."

Co-authored-by: Philipp Grete <gretephi@msu.edu>

JoshuaSBrown · 2021-02-01T15:57:54Z

Thanks, everyone for the feedback, @pgrete do you want to take a last look over?

pgrete · 2021-02-01T17:54:07Z

I think making sure that the advection_performance test in its current version is only run by a single rank is good!
With respect to the advection_convergence this change is not required (this was probably not clear from our discussion on Matrix). In fact, advection_convergence is designed to work with both a single rank as well as with 2, 4 and 8 ranks. So I think we should not set that number to 1.

JoshuaSBrown · 2021-02-01T17:59:46Z

I think making sure that the advection_performance test in its current version is only run by a single rank is good!
With respect to the advection_convergence this change is not required (this was probably not clear from our discussion on Matrix). In fact, advection_convergence is designed to work with both a single rank as well as with 2, 4 and 8 ranks. So I think we should not set that number to 1.

Ok, well I'm thoroughly confused, because when I run the convergence tests with 4 ranks and 4 gpus, it is still only making use of a single gpu. I'm also pretty sure I have the correct command because the restart test will actually utilize all four gpu's with the same command.

pgrete · 2021-02-01T18:10:35Z

I think making sure that the advection_performance test in its current version is only run by a single rank is good!
With respect to the advection_convergence this change is not required (this was probably not clear from our discussion on Matrix). In fact, advection_convergence is designed to work with both a single rank as well as with 2, 4 and 8 ranks. So I think we should not set that number to 1.

Ok, well I'm thoroughly confused, because when I run the convergence tests with 4 ranks and 4 gpus, it is still only making use of a single gpu. I'm also pretty sure I have the correct command because the restart test will actually utilize all four gpu's with the same command.

Let met double check in practice. At least on first sight I don't see a difference that may cause that behavior in the parameter/CMake files.

JoshuaSBrown · 2021-02-01T21:25:42Z


/usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338/Software/parthenon/build/example/advection/advection-example -i /g/g15/brown338/Software/parthenon/tst/regression/test_suites/restart/parthinput.restart parthenon/job/problem_id=gold --kokkos-num-devices=4 --kokkos-threads=1

Gives me

Mon Feb  1 13:10:48 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.95.01    Driver Version: 440.95.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000004:04:00.0 Off |                    0 |
| N/A   29C    P0    51W / 300W |    446MiB / 16160MiB |      5%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000004:05:00.0 Off |                    0 |
| N/A   29C    P0    50W / 300W |    446MiB / 16160MiB |      6%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000035:03:00.0 Off |                    0 |
| N/A   29C    P0    49W / 300W |    446MiB / 16160MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000035:04:00.0 Off |                    0 |
| N/A   30C    P0    48W / 300W |    446MiB / 16160MiB |      9%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    132066      C   ...ild/example/advection/advection-example   435MiB |
|    1    132067      C   ...ild/example/advection/advection-example   435MiB |
|    2    132068      C   ...ild/example/advection/advection-example   435MiB |
|    3    132069      C   ...ild/example/advection/advection-example   435MiB |
+-----------------------------------------------------------------------------+

But with advection covergence:

41: Test command: /usr/gapps/parthenon_shared/parthenon-project/views/rzansel/ppc64le/gcc8/2021-01-04/bin/python3.8 "/g/g15/brown338/Software/parthenon/tst/regression/run_test.py" "--mpirun" "/usr/tcetmp/bin/jsrun" "--mpirun_opts=-a" "--mpirun_opts=4" "--mpirun_opts=-c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu'" "--driver" "/g/g15/brown338/Software/parthenon/build/example/advection/advection-example" "--driver_input" "/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence/parthinput.advection" "--num_steps" "25" "--test_dir" "/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence" "--output_dir" "/g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_convergence_mpi" "--kokkos_args=--kokkos-num-devices=4 --kokkos-threads=1"
41: Test timeout computed to be: 1500
41: 
41: 
41: test_dir=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence']
41: output_dir='/g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_convergence_mpi'
41: driver=['/g/g15/brown338/Software/parthenon/build/example/advection/advection-example']
41: driver_input=['/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence/parthinput.advection']
41: kokkos_args=['--kokkos-num-devices=4 --kokkos-threads=1']
41: num_steps=25
41: mpirun=['/usr/tcetmp/bin/jsrun']
41: mpirun_opts=['-a', '4', "-c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu'"]
41: coverage=False
41: *****************************************************************
41: Beginning Python regression testing script
41: *****************************************************************
41: 
41: Initializing Test Case
41: Using:
41: driver at:       /g/g15/brown338/Software/parthenon/build/example/advection/advection-example
41: driver input at: /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence/parthinput.advection
41: test folder:     /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence
41: output sent to:  /g/g15/brown338/Software/parthenon/build/tst/regression/outputs/advection_convergence_mpi
41: 
41: Make output folder in test if does not exist
41: *****************************************************************
41: Preparing Test Case Step 1
41: *****************************************************************
41: 
41: *****************************************************************
41: Running Driver
41: *****************************************************************
41: 
41: Command to execute driver
41: /usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338/Software/parthenon/build/example/advection/advection-example -i /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence/parthinput.advection parthenon/mesh/nx1=32 parthenon/meshblock/nx1=32 parthenon/mesh/nx2=1 parthenon/meshblock/nx2=1 parthenon/mesh/nx3=1 parthenon/meshblock/nx3=1 Advection/vy=0.0 Advection/vz=0.0 --kokkos-num-devices=4 --kokkos-threads=1
41: 
41: *****************************************************************
41: Subprocess error message
41: *****************************************************************
41: 
41: b'### PARTHENON ERROR
41:   Message:     ### FATAL ERROR in Mesh constructor
41: Too few mesh blocks: nbtotal (1) < nranks (4)
41: 
41:   File:        ../src/mesh/mesh.cpp
41:   Line number: 450
41: ### PARTHENON ERROR
41:   Message:     ### FATAL ERROR in Mesh constructor
41: Too few mesh blocks: nbtotal (1) < nranks (4)
41: 
41:   File:        ../src/mesh/mesh.cpp
41:   Line number: 450
41: ### PARTHENON ERROR
41:   Message:     ### FATAL ERROR in Mesh constructor
41: Too few mesh blocks: nbtotal (1) < nranks (4)
41: 
41:   File:        ../src/mesh/mesh.cpp
41:   Line number: 450
41: ### PARTHENON ERROR
41:   Message:     ### FATAL ERROR in Mesh constructor
41: Too few mesh blocks: nbtotal (1) < nranks (4)
41: 
41:   File:        ../src/mesh/mesh.cpp
41:   Line number: 450
41: '
41: 
41: *****************************************************************
41: Error detected while running subprocess command
41: *****************************************************************
41: 
41: Traceback (most recent call last):
41:   File "/g/g15/brown338/Software/parthenon/tst/regression/utils/test_case.py", line 235, in Run
41:     proc = subprocess.run(run_command, check=True, stdout=PIPE, stderr=PIPE)
41:   File "/usr/gapps/parthenon_shared/parthenon-project/views/rzansel/ppc64le/gcc8/2021-01-04/lib/python3.8/subprocess.py", line 512, in run
41:     raise CalledProcessError(retcode, process.args,
41: subprocess.CalledProcessError: Command '['/usr/tcetmp/bin/jsrun', '-a', '4', '-c', '1', '-n', '1', '-g', '4', '-r', '1', '-d', 'packed', "--smpiargs='-gpu'", '/g/g15/brown338/Software/parthenon/build/example/advection/advection-example', '-i', '/g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence/parthinput.advection', 'parthenon/mesh/nx1=32', 'parthenon/meshblock/nx1=32', 'parthenon/mesh/nx2=1', 'parthenon/meshblock/nx2=1', 'parthenon/mesh/nx3=1', 'parthenon/meshblock/nx3=1', 'Advection/vy=0.0', 'Advection/vz=0.0', '--kokkos-num-devices=4', '--kokkos-threads=1']' returned non-zero exit status 134.
41: 
41: During handling of the above exception, another exception occurred:
41: 
41: Traceback (most recent call last):
41:   File "/g/g15/brown338/Software/parthenon/tst/regression/run_test.py", line 152, in <module>
41:     main(**vars(args))
41:   File "/g/g15/brown338/Software/parthenon/tst/regression/run_test.py", line 76, in main
41:     test_manager.Run()
41:   File "/g/g15/brown338/Software/parthenon/tst/regression/utils/test_case.py", line 245, in Run
41:     raise TestManagerError('\nReturn code {0} from command \'{1}\''
41: utils.test_case.TestManagerError: 
41: Return code 134 from command '/usr/tcetmp/bin/jsrun -a 4 -c 1 -n 1 -g 4 -r 1 -d packed --smpiargs='-gpu' /g/g15/brown338/Software/parthenon/build/example/advection/advection-example -i /g/g15/brown338/Software/parthenon/tst/regression/test_suites/advection_convergence/parthinput.advection parthenon/mesh/nx1=32 parthenon/meshblock/nx1=32 parthenon/mesh/nx2=1 parthenon/meshblock/nx2=1 parthenon/mesh/nx3=1 parthenon/meshblock/nx3=1 Advection/vy=0.0 Advection/vz=0.0 --kokkos-num-devices=4 --kokkos-threads=1'
1/1 Test #41: regression_mpi_test:advection_convergence ...***Failed   68.48 sec

pgrete · 2021-02-01T21:39:44Z

Yes, I also noticed that while updating/testing on Summit following my previous comment.
In that process I also noticed that the advection_performance test, in fact, should run with more MPI processes as I originally added the logic to make the MeshBlocks smaller based on the number of MPI processes involved.
That now makes me believe that the logic behind parameters.num_ranks is broken, i.e., it may always be 1.
This led me to test_case.py where

        argstrings = ['-np','-n']
        if len(set(argstrings) & set(self.parameters.mpi_opts)) > 1:
          print('Warning! You have set both "-n" and "-np" in your MPI options.')
          print(self.parameters.mpi_opts)
        for s in argstrings:
          if s in self.parameters.mpi_opts:
            index = self.parameters.mpi_opts.index(s)
            if index < len(self.parameters.mpi_opts) - 1:
              try:
                self.parameters.num_ranks = int(self.parameters.mpi_opts[index+1])
              except ValueError:
                pass

so I know wonder if that logic still works (in general and specifically for the jsrun based parameters used on Summit.
I'm going to call it a day but if you like to dig deeper today, my best bet is somewhere around these pieces.
Otherwise, I'll dig deeper tomorrow.

Bottom line: There's definitely a bug somewhere that resulted in the things not working as we expected them to be.

JoshuaSBrown · 2021-02-03T21:12:00Z

Alright this should be good to go.

JoshuaSBrown · 2021-02-03T21:12:28Z

All tests pass.

jlippuner

Nice work! I just tested it on RZAnsel and it seems to work as expected.

I made a few minor comments/suggestions but nothing to hold up merging.

jlippuner · 2021-02-03T21:35:03Z

docs/building.md

+#### Allocate Node
+
+[RZAnsel](https://hpc.llnl.gov/hardware/platforms/rzansel) is a homogeneous cluster consisting of 2,376 nodes with the IBM Power9
+architecture with 44 nodes per core and 4 Nvidia Volta GPUs per node. To


Suggested change

architecture with 44 nodes per core and 4 Nvidia Volta GPUs per node. To

architecture with 44 cores per node and 4 Nvidia Volta GPUs per node. To

jlippuner · 2021-02-03T21:39:40Z

docs/building.md

+$ lalloc 1
+```
+
+#### Set-Up Environment (Optional, but Still Recommended, for Non-CUDA Builds)


I am confused by this (also for the Darwin instructions).

Are these whole instructions for non-CUDA builds? If so, what are the instructions for CUDA builds? Or is it only optional (but still recommended) for non-CUDA builds but required for CUDA builds?

Alright, the last sentence answers this question, I think.

Maybe we should call this section Set-Up Environment (required for CUDA builds, optional (but recommended) for non-CUDA builds.

@AndrewGaspar, @jlippuner raises a good point why isn't the build configuration simply optional, as far as I can tell there is nothing in there but links to ninja, cmake, the compiler, and git. Is it because of the dependence of the cuda on the compiler?

jlippuner · 2021-02-03T21:40:19Z

docs/building.md

@@ -385,6 +385,69 @@ Once you've configured your build directory, you can build with
 LANL Employees - to understand how the project space is built out, see
 https://re-git.lanl.gov/eap-oss/parthenon-project

+### LNLL RZAnsel (Homogeneous)
+
+Last verified 04 Jan 2021.


Is this up-to-date?

I'll fix these in a separate PR then.

Joshua Scott Brown and others added 18 commits November 23, 2020 08:57

Add machine config file for RZAnsel

75311d3

Merge branch 'develop' into JoshuaSBrown/setup-rzansel-machine-config

64013c4

Updated project prefix to shared space path

05f7a6d

Adding machine config file

4454cd3

Specify MPI executable

435ae6a

Working

df46693

Update machine config file

93fabbf

Small fix

132ad55

Replace Darwin with RZAnsel name

4c976d6

Add documentation

0d91e4e

Add more documentation

8888658

Merge branch 'develop' into JoshuaSBrown/setup-rzansel-machine-config

dfe0161

Add changelog comment

88a4d1d

Fix machine variants for RZAnsel, specifically MPI disabled builds

2b5be66

Small cleanup

0cf3576

Cleanup

1115c41

Remove message

c1a49bc

Add documentation about machine variants

6ee95b8

revert kokkos commit

5aa561f

JoshuaSBrown changed the title ~~WIP Add machine config file for RZAnsel~~ Add machine config file for RZAnsel Jan 4, 2021

JoshuaSBrown requested review from AndrewGaspar, Yurlungur and pgrete January 4, 2021 22:00

JoshuaSBrown added ci build configuration and removed ci labels Jan 4, 2021

Fix path

75dd50f

Joshua S Brown and others added 2 commits January 19, 2021 16:18

Update cmake/TestSetup.cmake

cdbb09e

Co-authored-by: Philipp Grete <gretephi@msu.edu>

Update cmake/machinecfg/RZAnsel.cmake

fe1ad73

Co-authored-by: Philipp Grete <gretephi@msu.edu>

AndrewGaspar approved these changes Jan 19, 2021

View reviewed changes

JoshuaSBrown changed the title ~~Add machine config file for RZAnsel~~ [WIP] Add machine config file for RZAnsel Jan 27, 2021

Joshua Scott Brown and others added 3 commits February 1, 2021 07:52

Fix machine config file

aa89345

Merge branch 'develop' into JoshuaSBrown/setup-rzansel-machine-config

2721b08

Update cmake/machinecfg/RZAnsel.cmake

8ed2c18

Co-authored-by: Philipp Grete <gretephi@msu.edu>

JoshuaSBrown changed the title ~~[WIP] Add machine config file for RZAnsel~~ Add machine config file for RZAnsel Feb 1, 2021

pgrete mentioned this pull request Feb 2, 2021

Separate num ranks from other mpi options #435

Merged

5 tasks

Joshua Scott Brown added 4 commits February 3, 2021 10:43

Make number of GPUS consistent

7b8439b

Merge branch 'develop' into JoshuaSBrown/setup-rzansel-machine-config

1cc199b

Now working RZAnsel

7f9e2e1

Merge branch 'develop' into JoshuaSBrown/setup-rzansel-machine-config

793dda3

JoshuaSBrown enabled auto-merge (squash) February 3, 2021 21:12

Remove comment.

3678da0

jlippuner approved these changes Feb 3, 2021

View reviewed changes

JoshuaSBrown merged commit 0d96f15 into develop Feb 3, 2021

Yurlungur deleted the JoshuaSBrown/setup-rzansel-machine-config branch February 9, 2021 17:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add machine config file for RZAnsel #377

Add machine config file for RZAnsel #377

JoshuaSBrown commented Nov 23, 2020 •

edited

Loading

JoshuaSBrown commented Jan 4, 2021

JoshuaSBrown commented Jan 4, 2021 •

edited

Loading

JoshuaSBrown commented Jan 4, 2021

JoshuaSBrown commented Jan 20, 2021 •

edited

Loading

pgrete commented Jan 20, 2021

AndrewGaspar commented Jan 21, 2021

pgrete commented Jan 22, 2021 •

edited

Loading

JoshuaSBrown commented Feb 1, 2021

pgrete commented Feb 1, 2021

JoshuaSBrown commented Feb 1, 2021 •

edited

Loading

pgrete commented Feb 1, 2021

JoshuaSBrown commented Feb 1, 2021

pgrete commented Feb 1, 2021

JoshuaSBrown commented Feb 3, 2021

JoshuaSBrown commented Feb 3, 2021

jlippuner left a comment

jlippuner Feb 3, 2021

jlippuner Feb 3, 2021

jlippuner Feb 3, 2021

JoshuaSBrown Feb 8, 2021

jlippuner Feb 3, 2021

JoshuaSBrown Feb 3, 2021

	architecture with 44 nodes per core and 4 Nvidia Volta GPUs per node. To
	architecture with 44 cores per node and 4 Nvidia Volta GPUs per node. To

Add machine config file for RZAnsel #377

Add machine config file for RZAnsel #377

Conversation

JoshuaSBrown commented Nov 23, 2020 • edited Loading

PR Summary

PR Checklist

JoshuaSBrown commented Jan 4, 2021

JoshuaSBrown commented Jan 4, 2021 • edited Loading

JoshuaSBrown commented Jan 4, 2021

JoshuaSBrown commented Jan 20, 2021 • edited Loading

pgrete commented Jan 20, 2021

AndrewGaspar commented Jan 21, 2021

pgrete commented Jan 22, 2021 • edited Loading

JoshuaSBrown commented Feb 1, 2021

pgrete commented Feb 1, 2021

JoshuaSBrown commented Feb 1, 2021 • edited Loading

pgrete commented Feb 1, 2021

JoshuaSBrown commented Feb 1, 2021

pgrete commented Feb 1, 2021

JoshuaSBrown commented Feb 3, 2021

JoshuaSBrown commented Feb 3, 2021

jlippuner left a comment

Choose a reason for hiding this comment

jlippuner Feb 3, 2021

Choose a reason for hiding this comment

jlippuner Feb 3, 2021

Choose a reason for hiding this comment

jlippuner Feb 3, 2021

Choose a reason for hiding this comment

JoshuaSBrown Feb 8, 2021

Choose a reason for hiding this comment

jlippuner Feb 3, 2021

Choose a reason for hiding this comment

JoshuaSBrown Feb 3, 2021

Choose a reason for hiding this comment

JoshuaSBrown commented Nov 23, 2020 •

edited

Loading

JoshuaSBrown commented Jan 4, 2021 •

edited

Loading

JoshuaSBrown commented Jan 20, 2021 •

edited

Loading

pgrete commented Jan 22, 2021 •

edited

Loading

JoshuaSBrown commented Feb 1, 2021 •

edited

Loading