-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add machine config file for RZAnsel #377
Add machine config file for RZAnsel #377
Conversation
Waiting for this https://re-git.lanl.gov/eap-oss/parthenon-project/-/merge_requests/2 to be merged first. |
This is now ready for review with the exception of these lines:
I will fix this once the parthenon-project repo is updated and it is updated in RZAnsel. |
Ok, this is good to go. |
Co-authored-by: Philipp Grete <gretephi@msu.edu>
Co-authored-by: Philipp Grete <gretephi@msu.edu>
@pgrete I could use some help with this, either there is some logic in parthenon that needs to be corrected or the tests simply are not configured to run with 4 gpus and 4 mpi procs: This first scenario should be the one that works: I can verify that nvidia-smi indicates 4 tasks are launched.
@jonahm-LANL I could use your input here as welll. |
This simple |
@pgrete Yeah, that's all well and good - we should definitely ensure we're holding the num of GPUs constant when we're doing performance testing, which you can control using |
Sorry, my previous comment was probably not clear. It may be worth to add safety check (i.e., ranks == 1) to that test similar to the advection_convergence test: # make sure we can evenly distribute the MeshBlock sizes
err_msg = "Num ranks must be multiples of 2 for convergence test."
assert parameters.num_ranks == 1 or parameters.num_ranks % 2 == 0, err_msg
# ensure a minimum block size of 4
assert lin_res[0] / parameters.num_ranks >= 4, "Use <= 8 ranks for convergence test." |
Co-authored-by: Philipp Grete <gretephi@msu.edu>
Thanks, everyone for the feedback, @pgrete do you want to take a last look over? |
I think making sure that the |
Ok, well I'm thoroughly confused, because when I run the convergence tests with 4 ranks and 4 gpus, it is still only making use of a single gpu. I'm also pretty sure I have the correct command because the restart test will actually utilize all four gpu's with the same command. |
Let met double check in practice. At least on first sight I don't see a difference that may cause that behavior in the parameter/CMake files. |
Gives me
But with advection covergence:
|
Yes, I also noticed that while updating/testing on Summit following my previous comment. argstrings = ['-np','-n']
if len(set(argstrings) & set(self.parameters.mpi_opts)) > 1:
print('Warning! You have set both "-n" and "-np" in your MPI options.')
print(self.parameters.mpi_opts)
for s in argstrings:
if s in self.parameters.mpi_opts:
index = self.parameters.mpi_opts.index(s)
if index < len(self.parameters.mpi_opts) - 1:
try:
self.parameters.num_ranks = int(self.parameters.mpi_opts[index+1])
except ValueError:
pass so I know wonder if that logic still works (in general and specifically for the Bottom line: There's definitely a bug somewhere that resulted in the things not working as we expected them to be. |
Alright this should be good to go. |
All tests pass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I just tested it on RZAnsel and it seems to work as expected.
I made a few minor comments/suggestions but nothing to hold up merging.
#### Allocate Node | ||
|
||
[RZAnsel](https://hpc.llnl.gov/hardware/platforms/rzansel) is a homogeneous cluster consisting of 2,376 nodes with the IBM Power9 | ||
architecture with 44 nodes per core and 4 Nvidia Volta GPUs per node. To |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
architecture with 44 nodes per core and 4 Nvidia Volta GPUs per node. To | |
architecture with 44 cores per node and 4 Nvidia Volta GPUs per node. To |
$ lalloc 1 | ||
``` | ||
|
||
#### Set-Up Environment (Optional, but Still Recommended, for Non-CUDA Builds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am confused by this (also for the Darwin instructions).
Are these whole instructions for non-CUDA builds? If so, what are the instructions for CUDA builds? Or is it only optional (but still recommended) for non-CUDA builds but required for CUDA builds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, the last sentence answers this question, I think.
Maybe we should call this section Set-Up Environment (required for CUDA builds, optional (but recommended) for non-CUDA builds
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AndrewGaspar, @jlippuner raises a good point why isn't the build configuration simply optional, as far as I can tell there is nothing in there but links to ninja, cmake, the compiler, and git. Is it because of the dependence of the cuda on the compiler?
@@ -385,6 +385,69 @@ Once you've configured your build directory, you can build with | |||
LANL Employees - to understand how the project space is built out, see | |||
https://re-git.lanl.gov/eap-oss/parthenon-project | |||
|
|||
### LNLL RZAnsel (Homogeneous) | |||
|
|||
Last verified 04 Jan 2021. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this up-to-date?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll fix these in a separate PR then.
PR Summary
This adds a Machine Config file for RZAnsel. This should make it easier for users to get up and running with a working build. Using the ideas that @pgrete had in the Summit config file three variants are allowed to be built, "cuda", "mpi", "cuda+mpi".
Note that I have not removed the old documentation for building on RZAnasel because this was erroneously altered from the Summit documentation. As per my conversation with @pgrete he was going to fix it, so I did not want to touch it.
PR Checklist