-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mpi4py] Regression with ompi@main #12195
Comments
@rhc54 I'm also getting this output at the beginning of my runs:
|
I also see similar warnings in my local run on main branch
|
I see how to disable that warning - Linux does something not POSIX compliant, but I can cover it. I'm afraid there isn't enough info about the primary error being reported here for me to do anything about it. |
Here's the warning fix: openpmix/openpmix#3252 |
Many thanks, Ralph.
Any chance that the error was actually related to the fix you pushed? |
Can't say - you haven't yet told me what the error is 😉 |
The error is in the description of this issue. "An unrecoverable error occurred in the gds/shmem component. ..." Looks like this happens while trying to construct an intercommunicator with This happens when running with oversubscription (GitHub Actions runners have 2 virtual cores, and I'm running with 3 MPI processes). Maybe oversubscription has nothing to do with the error, and the issue triggers just because I'm using a small odd number of processes or some corner case like that. Unfortunately, I'm not able to run this locally right now. |
Okay, I wasn't sure if that was the error you were talking about. Hard to figure what could cause that as the procs would all have been started before the intercomm create. What does your test actually do - have a proc that spawns more procs and then tries to create a communicator between them? Is the error coming from one of the spawned procs - or all of them? Once I know what your test does and where the error comes from, I can try to replicate it here. I doubt it has anything to do with the oversubscription. |
I see where the error is coming out and why - what I don't understand is what is triggering it. I'm not able to reproduce the problem here when running a simple test that does a spawn and then creates the intercommunicator. Might help to also know what environment your CI is running. I could disable the shmem component or revert the commit that added the code behind the error report. However, the person who supports that component may be out this week and I'd prefer to get his input on it. We have a meeting Thurs - I'll raise it with him if he attends. |
The failing test may actually be one involving The
If you can read some slightly convoluted Python, the test code is split in two hunks here and here. PS: This particular test has had regressions at least two or three times in the past after PMIx submodule pointer updates. Looks like Open MPI folks should add a test into their suite to prevent this issue to resurfacing over and over. |
GitHub Actions with GitHub-hosted runners running their Ubuntu 22.04 image.
Yes, better to wait for the expert and get to the bottom of it. |
@hppritcha @samuelkgutierrez This may be hard to reproduce as it sounds like quite a convoluted process. Since it might have something to do with PMIx (it isn't immediately clear to me how PMIx is involved here, nor why gds/shmem would impact it), it would be good if we could distill from the test the specific PMIx calls (and their order) being made so we could create a PMIx-only model of it. |
What worked for me in the past is to bisect on the suspect repo and locate the problematic commit first. The problem will be more obvious if we can narrow it down to a few lines. |
I already know the commit that is involved here - sorry I wasn't complete in my answer. What isn't clear is why that commit is triggered by this collage of MPI calls in this CI. Is it the environment? Is it some combination of calls that triggers it? Or something else? Just need to understand the parameters of the problem. Would help if we could extract the MPI from the test and run it elsewhere (i.e., not in GitHub Actions and/or Ubuntu 22.04) to see if it reproduces. |
i'll take a look at this later today or tomorrow. |
reproduced on a aarch64 RHEL8 system
|
Here's what the PMIX GDS has to say about this
|
This one looks like it is mine. Can someone please assign it to me? |
okay i''m writing up a github action to run mpi4py tests with every PR. |
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Thanks again for catching this, @dalcinl. This issue should be fixed in OpenPMIx now. Once the submodule pointers are updated in Open MPI, this regression should be fixed. |
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Fixes all merged in. Closing. |
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
seems like mpi4py finds a problem almost every time we advance openpmix/prrte shas so catch it early here. we test mpi4py master as it contains the mpi 4 stuff that so often breaks. related to open-mpi#12195 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
https://github.com/mpi4py/mpi4py-testing/actions/runs/7294717237/job/19880980395
The text was updated successfully, but these errors were encountered: