Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI.Init on ARCHER2 either segfaults or hangs most of the time on multi-node jobs #623

Closed
giordano opened this issue Sep 9, 2022 · 5 comments
Milestone

Comments

@giordano
Copy link
Member

giordano commented Sep 9, 2022

On ARCHER2 (UK Tier-1 system), I'm observing that calling MPI.Init with system's Cray MPICH (version string "MPI VERSION : CRAY MPICH version 8.1.4.31 (ANL base 3.4a2)\nMPI BUILD INFO : Thu Mar 18 17:07 2021 (git hash 3e74f0c)\n") either segfaults or hangs most of the time when running multi-node jobs. Note: this happens only on master of MPI.jl, but not v0.19.2, so it'd appear something wrong is going on with MPI.jl#master.

I don't have much time to investigate this further at the moment, I'm opening this issue as a reminder to try and look into this at some point.

@giordano giordano added this to the 0.20 milestone Sep 9, 2022
@simonbyrne
Copy link
Member

Bisection time!

@giordano
Copy link
Member Author

I'm utterly confused: hangs/segfaults happen if I install MPI.jl with ]add MPI#4a87d7402ac3baba5cc97bfd8d5bd4cfbb825525, but not if I ]dev MPI and check out the same revision 😐 my attempt to git bisect the issue failed badly because dev'ing the package works. This makes extremely little sense to me.

@simonbyrne
Copy link
Member

Is this related at all to #616?

@giordano
Copy link
Member Author

I forgot about that one, thanks for pointing it out. I'm not sure, ]add MPI works fine for me, and this installs 0.19.2 which seems to be version used in #616 (at least, they mention the JULIA_MPI_* environment variables). It's only when I do ]add MPI#4a87d7402ac3baba5cc97bfd8d5bd4cfbb825525 (but not ]dev MPI) that I get the segfaults, or more often hangs. I know this sounds absurd, I'm also at a loss here.

@giordano
Copy link
Member Author

I'm going to close the ticket as I can't reproduce the issue anymore with ]add MPI#4cd71180218a5bb2e88a7c153ef8538dbb7ea74e. Sigh.

@giordano giordano closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2022
@giordano giordano changed the title MPI.Init on ACHER2 either segfaults or hangs most of the time on multi-node jobs MPI.Init on ArCHER2 either segfaults or hangs most of the time on multi-node jobs Oct 10, 2022
@giordano giordano changed the title MPI.Init on ArCHER2 either segfaults or hangs most of the time on multi-node jobs MPI.Init on ARCHER2 either segfaults or hangs most of the time on multi-node jobs Oct 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants