-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running opal_fifo test intermittently hangs on Power8 #5470
Comments
verifying. |
Verified. I hit the hang in Iteration4.
Tomorrow I'll debug. |
I've verified |
Do you know if you can hit the problem on Power9? |
I'll check. I hope that #5374 resolved this on Power9, but I didn't try the stress test. |
I'll check on my Power9 system as well |
Power9 has run over 140 Iterations successfully. I did not disable the buildin atomics when testing on power9. So, looks like it's just a power8 issue. Power8 does by default detect this:
|
I didn't have as much luck on my Power9 system. With
|
So you power9 was hanging also? |
Yes, power9 hung too. I tried it on Ubuntu 18.04, I also tried it in an Alpine container hosted on Ubuntu 18.04 and experienced the hang there too. |
@mksully22 Github pro tip: use a line of 3 back ticks to start and end verbatim sections in github to make the text render better (see https://help.github.com/articles/creating-and-highlighting-code-blocks/). |
@gpaulsen Any progress? |
Once we pull #5445 we should be able to even drop builtin atomics completely. |
Removing critical since there is a workaround for disabling builtin atomics. |
This should be resolved by dropping back to the atomic assembly on power machines in these pr's to v4.1 and v5: v4.1.x: #8708 See issue: |
Thank you for taking the time to submit an issue!
Background information
Running opal_fifo test intermittently hangs on Power8. Detailed debug info is provided below
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
Using master branch commit level 92d8941
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Please describe the system on which you are running
Details of the problem
Using the following script to exercise the opal_fifo. The testcase will hang intermittantly. htop shows all 8 opal_fifo LWPs running at 100% CPU
Start opal_fifo stress script to reproduce:
Looking at the running processes/LWPs
Using gdb to collect some info on where the LWPs are:
Note: LWPs 1,2,4-8 are all caught in this loop (I had gdb display some of the noteworth variable values):
Note: Thread 3 is looping a bit farther down
Is there any additional information that I could collect that would be helpful to diagnose the issue?
The text was updated successfully, but these errors were encountered: