-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix MPI communicator leak #841
Conversation
This is "finished" but not tested to work. When I get confirmation it fixes the issue downstream I'll un-WIP it. |
@par-hermes format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
This is another one that we ran into on the Riot branch of Parthenon that never got merged back into main... sorry. I used a pretty similar strategy, but without the smart pointer. (see Edit: Oh, I didn't see that @bprather commented on this already. |
src/utils/reductions.hpp
Outdated
MPI_Comm *comm_tmp; | ||
PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm_tmp)); | ||
comm = std::shared_ptr<MPI_Comm>(comm_tmp, MPI_Comm_free); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this causes a runtime error, probably because MPI_Comm_dup
doesn't allocate. I suggest the following:
MPI_Comm *comm_tmp; | |
PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm_tmp)); | |
comm = std::shared_ptr<MPI_Comm>(comm_tmp, MPI_Comm_free); | |
comm = std::shared_ptr<MPI_Comm>(new MPI_Comm, MPI_Comm_free); | |
PARTHENON_MPI_CHECK(MPI_Comm_dup(MPI_COMM_WORLD, comm.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this setup we could also put the initialization of comm
in the var definition, e.g.,
std::shared_ptr<MPI_Comm> comm(new MPI_Comm, MPI_Comm_free);
src/utils/reductions.hpp
Outdated
@@ -59,20 +59,26 @@ struct ReductionBase { | |||
T val; | |||
#ifdef MPI_PARALLEL | |||
MPI_Request req; | |||
MPI_Comm comm; | |||
std::shared_ptr<MPI_Comm> comm; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a pointer now, it might also make sense to rename it pcomm
for clarity.
This fix does not work for me on Crusher:
|
Ok, with the suggested changes from @Yurlungur, it works. |
We also need this fix for AthenaPK. |
…er before finalize
Resolve conflict when fixing the same bug
Sorry I'm a bit late on this. This wasn't passing tests initially, and I found I needed to scope the I think more people may get caught by this issue, though: we might want to establish that all Parthenon objects be set off in a code block inside |
Please don't forget to add the CHANGELOG @bprather :) |
Approved assuming tests pass |
Ok, great. Actually it was @pgrete who discovered the finalization issue :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice clean solution!
It looks like the Poisson solver regression test fails:
|
Hmm, this is exactly the error & location I was seeing before adopting the Driver scoping fix -- but, I no longer see it on MPICH. What MPI stack do the tests use? Maybe e.g. OpenMPI implementations hold tighter to comm handles? Is there a way to access the full CTest log for the automated tests? That would hopefully confirm it's the same error and provide any implementation-specific context. |
I see what looks like the full log here: https://github.com/parthenon-hpc-lab/parthenon/actions/runs/4469416747/jobs/7851492978?pr=841#step:8:10205 The configure step says it's using OpenMPI:
|
Ah, should have checked the compile output, thanks. I'll try to reproduce this by switching to OpenMPI on my test setup. Re: the log, I saw that much -- I think there may be an artifact, usually |
That would be useful to have the stderr output. It looks like the artifact upload is canceled if the tests fail, but this can be changed: https://docs.github.com/en/actions/learn-github-actions/expressions#always |
Okay, here's the situation as near as I can tell. I was able to reproduce the issue when compiling inside the Rather than MPI, the key difference appears to be the version of HDF5, with 1.12.2 specifically causing the issue (I believe even 1.12.1 is not affected, but I'm not sure of this yet). This appears to be due to an attempted call to Still testing different stacks (e.g., not sure if HDF5 1.12.2/MPICH exhibits this issue) but that's the basic idea. So, it'll be a little while to fix I think :/ |
Wow, that's incredibly annoying of HDF5. Can we switch the CI container to use 1.12.1, or do we have to use 1.12.2 for some other reason? |
Why exactly does
I don't think we're tied specifically to 1.12.2 but it would be good to know what's going on here. |
This is a cool bug, which is to say it's driving me insane. It is a race condition, in that it only appears when running under MPI with > 1 rank, and then only randomly:
which is a delight to test for! FWIW I've never actually triggered the bug with 2 ranks, either, only 4. It does not appear to be related to HDF5 or OpenMPI version, in that I can reproduce it with varying frequencies on every combination I've tried (I haven't tested on MPICH, since "don't use OpenMPI" is not a valid workaround). Sprinkling barriers on all MPI communicators and Initially, I had thought this might be a pool/allocator issue: MPI frees "Comm1," HDF5 asks for a communicator, MPI gives it the memory that backed "Comm1" as "Comm2." Then, MPI sees that "Comm1" is unused, frees the backing memory, and thereby pulls the rug out from under "Comm2." However, since we can wait 5s from the last MPI call and still hit the race condition, I'm now doubtful this is the case. Worse, I have no theory to replace it. The fact is, if any communicator has been freed during runtime, I can wait a seemingly arbitrary amount of time, and the next |
@bprather out of curiosity, does @lroberts36 version, which just added a destructor to the reducers fix the issue? If so, maybe that's just the easiest solution? |
I strongly believe it will not, becuase According to the standard, each communicator should be reference counted between processes, by incrementing a shared counter in Calling MPI has a synchronous version of |
Actually hey I think I know what it is. @bprather move the reducer objects in Poisson out of the driver and into mutable params. The problem is that the driver object persists beyond the I had this same issue with reducers over Kokkos views. That's why the views are in params, not the driver, in the first place. |
I think what you're describing is actually the original issue I ran into, which printed something about I've verified that this issue is orthogonal to that one -- it appears regardless of Driver scoping when creating the last output files, which happens as a part of Continuing to poke at this a bit but probably no news before Mon. |
After experimenting with this, I can find no reliable way of reproducing it on every invocation, even inside the testing container. However, adding MPI barriers alone, without the absurd waiting times between operations actually made the race condition much less frequent. I found I could achieve the same stabilizing effect as the barriers (at least, on my machine) by switching out I make no claim that this solves the underlying problem, but it does make the error so infrequent on my machine (5% of invocations? Less?) that I can't pursue it much further without a more reliable instance of the error occurring at all. I'd recommend as many people as possible try this fix: either the fix works for all cases, or someone finds a way to more reliably reproduce the race condition here. |
Does this change resolve the issue on the CI? If so that's probably good enough |
A manual CI run passed, so I've toggled auto-merge to trigger everything with actual reporting. I guess, speak now or etc etc |
This fixes an issue seen in long runs which perform many reductions, where, depending on the MPI implementation, the program might run out of available file handles, or exhaust host memory entirely. This was due to allocation of an MPI communicator via
MPI_Comm_dup
upon creation of aReduce
orAllReduce
object, without an accompanying call toMPI_Comm_free
when an object and/or its copies were deleted.This issue was previously fixed in Riot branch by adding a destructor to
ReductionBase
, which calledMPI_Comm_free
.However, that fix relies on the MPI implementation to ref-count communicators such that calling
MPI_Comm_free
twice on a communicator would not result in a double-free error. It appears not all implementations do this, at least in my experience using that fix in KHARMA.This fix instead declares the member variable
comm
as astd::shared_ptr
, withMPI_Comm_free
registered as a custom destructor.It also adds some
PARTHENON_MPI_CHECK
s that were missing.