-
Notifications
You must be signed in to change notification settings - Fork 876
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CI/Jenkins failures: timeouts #4686
Comments
I am unable to reproduce this issue manually with the Any suggestions on how to replicate would be appreciated. |
are you pinning to cores? their set is interesting; they tasksel to 2 cores for mpirun, which I think means the mpi processes will inherit that, and then spawn 4 threads each on the 2 processes. |
Yes, this pinning introduces resource contention and once helped us to identify the issue in the code that went in unnoticed (#1813) |
Good call -- no, I wasn't. Didn't help, though:
This is an optimized build of master head, obviously using vader on a single node. Let me try a more recent machine (although it's got a slightly slower clock speed)... Nope, it still runs fine there, too:
@artpol84 Just out of curiosity, since you're binding to Linux virtual IDs 0,1, what's the load on the machine at the time? I.e., can you change your script to output the load before running each of these tests? Also, is any other Jenkins test also binding to LVIDs 0,1? I.e., are we just banging the hell out of LVIDs 0 and 1 via the union of all currently-running Jenkins tests, and therefore they're just running incredibly slowly? As you noted, testing for contention is good, but are we testing for too much contention sometimes? Although I do notice that in the Mellanox output, we see the stdout from the entire |
With openib
|
And with TCP:
|
@jsquyres when I was reproducing manually I saw that no process was running on the node at the moment. |
I mean no CPU-intence procs was there except mine. |
I just noticed that MPI_Send and MPI_Recv tags are different. It reminds me the issue I cited here previously. |
And note, that I wasn't binding at all so this is not the root cause! |
I checked the test. tag is getting incremented on each iteration by both of the ranks. It looks like receiver moves to the next iteration while sender still waits for the previous message to be completed. |
@vspetrov @jladd-mlnx it seems that this issue is also related to hcoll issue we observed here If I disable hcoll I can't reproduce on bgate anymore. One thing concerns me: why in other branches (like for #4685 for which I provided Jenkins link) test finishes successfully and hangs at MPI_Finalize, while for master we see the issue much earlier during unrelated point-to-point phase. |
@jsquyres I've fixed timeout as we discussed on the call:
We were automatically detecting availability of |
I'm going to add similar fix to the oshmem code to address http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7044/console. |
@artpol84 Excellent -- that makes the Mellanox Jenkins output much easier to parse/understand. Thanks! Just to be clear: are you saying that there's still a vader issue on master? Or is this also related to hcoll? (I ask because in build 7042 -- linked above -- it's hanging in an hcoll-enabled run) |
@artpol84 In #4683 (comment), you cited leftover files in |
|
Regarding vader leftover. Yes I see these files again in /dev/shm.
However the issue that leftover was causing looked like this:
while errors in 7042 and 7047 occur later when Send/Recv is performed. |
and it seems that after 4047 has failed it lead to subsequent failures because of leftover: |
7054 finised ok: http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7054/ after removing vader leftover. |
So is the real problem the vader leftover in |
No, it seems like a 2 real problems.
|
FWIW: there is a cleanup mechanism in master to resolve the leftover problem. However, when you kill mpirun by hitting it with SIGKILL, you don't give it a chance to invoke the cleanup code. My best guess is that is what caused your second case. I'll check to ensure that the mpirun timeout option does invoke the cleanup - pretty sure it does, but worth a check. |
According to the output of http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7047/console hang was interrupted with mpirun timeout:
And if you will check the list of tested PRs (http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/) between 4047 and 4053 (both failed) you will see that successful PRs in between was targeting non-master branches. So it could be a problem with this cleanup feature. |
so far as I can determine, yes |
@hjelmn one thing that concerns me is that since leftover was causing issues for sequentially launched applications it can also cause issue if multiple MPI programs will run on the same node.
File names do not contain any identification information and this may cause issues in multi-user node usage. It makes sense to add some IDs (like jobid or PID) to the filename to protect from that. |
Just checked on another system and the timeout on master still isn't removing the /dev/shm files. Will play with it more as time permits. |
@artpol84 @jladd-mlnx Per discussion 2017-01-16 webex, Mellanox agreed to do a manual build on the Jenkins machine (since it seems to be the only machine that can replicate this problem |
What issue out of 2 we are tracking? |
Improper leftovers should now be handled (via #4701). The only issue left should be the vader hang. |
Vader hangs should now be fixed via #4767. |
A bunch of CI tests have been failing recently, particularly in the Mellanox Jenkins.
Per discussion on the 2017-01-09 teleconf, we were reminded that most of the segv's that we see in the Mellanox Jenkins (e.g., #4683 Mellanox Jenkins master build http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/7030/) are actually timeouts.
Meaning: it looks like some threaded and/or vader-based tests are timing out. But not 100% of the time. @artpol84 confirmed that he can reproduce if he runs tests multiple times (i.e., sometimes the test passes, sometimes it fails). Given that the failures have typically been involving Vader and/or multi-threaded tests, @rhc54 points out 8b8aae3 which was a recent ASM commit.
Investigation is required. @jsquyres volunteered to try to reproduce as well.
The text was updated successfully, but these errors were encountered: