Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize HIP before MPI for older Cray MPICH versions #1090

Merged
merged 1 commit into from
Feb 6, 2024

Conversation

msimberg
Copy link
Collaborator

@msimberg msimberg commented Feb 1, 2024

I should've done this a long time ago, but was reminded again when running benchmarks on clariden. On LUMI Cray seems to have fixed the issue of MPI init making HIP devices disappear. If HIP is initialized before MPI, everything seems to work normally. On clariden this is still an issue. This puts a hipInit into the mpi_init struct which makes it effective in all the miniapps. If needed we can consider doing the same for tests with #982.

@msimberg msimberg self-assigned this Feb 1, 2024
@msimberg
Copy link
Collaborator Author

msimberg commented Feb 2, 2024

cscs-ci run

Copy link
Member

@RMeli RMeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

FYI, the same workaround was added in CP2K a while back (suggested by @msimberg): cp2k/cp2k#3121

@@ -26,6 +31,12 @@ namespace comm {
struct mpi_init {
/// Initialize MPI to MPI_THREAD_MULTIPLE
mpi_init(int argc, char** argv) noexcept {
// On older Cray MPICH versions initializing HIP after MPI leads to HIP not seeing any devices. Hence
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which version is ok and which is problematic?
It would be nice to put it in the comment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish I knew. To be honest now that I checked it may not even be version dependent. I went back and checked lumi and clariden and they both have libfabric 1.15.2.0 and mpich 8.1.25. Though LUMI claims to default to 8.1.27 because of some issue (https://lumi-supercomputer.eu/lumi-service-status/information-lumi-unavailable-due-to-hardware-installations-and-software-upgrades-starting-on-20-october-until-6-november/). I still see 8.1.25 linked on lumi though (maybe my environment is outdated).

The comment should maybe just say something more generic like "some cray systems do this...".

@rasolca rasolca merged commit 8516c8e into eth-cscs:master Feb 6, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants