-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initialize HIP before MPI for older Cray MPICH versions #1090
Conversation
e2584d5
to
a3077ca
Compare
cscs-ci run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
FYI, the same workaround was added in CP2K a while back (suggested by @msimberg): cp2k/cp2k#3121
@@ -26,6 +31,12 @@ namespace comm { | |||
struct mpi_init { | |||
/// Initialize MPI to MPI_THREAD_MULTIPLE | |||
mpi_init(int argc, char** argv) noexcept { | |||
// On older Cray MPICH versions initializing HIP after MPI leads to HIP not seeing any devices. Hence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know which version is ok and which is problematic?
It would be nice to put it in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish I knew. To be honest now that I checked it may not even be version dependent. I went back and checked lumi and clariden and they both have libfabric 1.15.2.0 and mpich 8.1.25. Though LUMI claims to default to 8.1.27 because of some issue (https://lumi-supercomputer.eu/lumi-service-status/information-lumi-unavailable-due-to-hardware-installations-and-software-upgrades-starting-on-20-october-until-6-november/). I still see 8.1.25 linked on lumi though (maybe my environment is outdated).
The comment should maybe just say something more generic like "some cray systems do this...".
I should've done this a long time ago, but was reminded again when running benchmarks on clariden. On LUMI Cray seems to have fixed the issue of MPI init making HIP devices disappear. If HIP is initialized before MPI, everything seems to work normally. On clariden this is still an issue. This puts a
hipInit
into thempi_init
struct which makes it effective in all the miniapps. If needed we can consider doing the same for tests with #982.