-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NOX tests timing out in ATDM configuration RHEL6 builds #2628
Comments
@bartlettroscoe, I just created this issue thought you would like to know |
The following queries provide some more information: What do the totality of those queries seem to indicate? What trends to you see? |
The specific builds that seem to be effected are:
After looking more closely, a few more observations:
|
These tests all use Epetra (and not Tpetra). Two questions:
|
If you look at the three failures on
The other test was a setgfault late into running the test:
These same failures did not occur for the identical build on 'white' that day. Given all of this, I think the failures on 'ride' for the build |
@etphipp said
Yes, I think that is what is happening as described in #2422, and suspect that the machine is being overloaded. Looking at: it seems this is running with The Jenkins job: shows it with a job weight of only 20. Does this means that we are overloading the machine by 12 cores? How many cores does 'hansel' and 'gretel' actually have? Can you please create a PR that reduces If we can get this in by tonight, then we can see what happens tomorrow morning. |
@trilinos/nox, Should we disable Epetra support in NOX for ATDM? Do the ATDM APPs need Epetra support in NOX? Should we set |
They can use Epetra and NOX, so I wouldn’t do that quite yet. |
I am implementing the NOX solvers in ATDM, and I have been developing them to support Epetra and Tpetra |
I don't think we can actually disable epetra in panzer. That will take some work. |
@etphipp wrote:
I think Epetra uses OpenMP for sparse matrix-vector multiplies and axpys by default, as long as OpenMP is enabled in the build. |
There is an Epetra CMake option to disable use of OpenMP. |
It does not look like Epetra has a way to disable OpenMP just for Epetra as shown at:
It is just hard-coded to use We could add a proper |
I rebuilt the same version as yesterday and ran the tests on
test result summary:
|
@fryeguy52, what happens when you run just:
for example? Does that one test run all by itself timeout? What about the other tests? |
run individually they all pass:
if I just run
running all tests with
|
@bartlettroscoe wrote:
It's worth trying, at least to see if that fixes the tests. |
Wow, that is unfortunate. Even with all other thread bounds to the same core, I don't see how a test that takes 4.8s to complete can timeout at 10 minutes! Let's try disabling OpenMP in Epetra and see what happens. @fryeguy52, I will create a branch of Trilinos that disables OpenMP for our ATDM builds and then point you to it to test. |
I did an atdm build on my (nearly rhel6) machine and was able to reproduce the behavior, and I think I see what the issue is and how to resolve it without disabling OpenMP. By default, mpirun from OpenMPI bounds processes to cores if the number of processes <= 2. When you run ctest -jX with X at least 4, then multiple 2-MPI rank tests are run simultaneously, and it appears from watching the process monitor on my machine, bound to the same two cores. Each MPI rank is then running 2 threads, so you end up with 4 threads trying to run on the same core simultaneously. A simple solution is to not have mpirun bound the MPI ranks to cores by adding the configure option:
This will free the OS to move the processes around to balance the machine. Normally this a bad thing to do because each MPI rank will then try and run OpenMP threads across the whole machine, but since you are setting OMP_NUM_THREADS=2, each MPI rank will only run 2 threads. This fixed the timeouts on my machine, and also made the whole test suite run a lot faster (took about 10s to run all 105 tests using "ctest -j32" on my 32 core machine). |
This solved the timeouts for the NOX test suite for me on ceerws1113.
@etphipp, Thanks! That did the trick for me too. The PR #2638 applies this fix and I verified that it fixed the NOX tests. Can you go ahead and approve the PR? Is this an initial solution to the OpenMP binding problem described in #2422, to allow us to enable OpenMP in an auto PR build? |
Great! I approved the PR.
I suppose its an initial solution, but I don't see it being an ideal one. Ideally ctest would correctly bind MPI ranks to separate cores when launching multiple tests simultaneously, and respect the number of threads each test is using. |
Don't bind MPI Ranks to process to solve OpenMP timeouts (#2628)
After the merge of PR #2638, there are no more timeouts at all in these ATDM Trilinos RHEL6 builds. for example, you can see all the newly passing tests for the build See the Now there is just one failing Teko test and then all of these builds can be promoted to the "ATDM" CDash Track/Group. Closing this issue as complete! |
CC: @trilinos/nox
Next Action Status
Merged PR #2638 fixed all of the timing out tests in the RHEL6 build starting 4/26/2018.
Description
There are 4 NOX tests that are failing due to timeout in the ATDM configuration when building on RHEL6 with the sems environment. They have been failing in this way since 4/18/18. The test are:
NOX_LOCA_BrusselatorHopf_MPI_2
NOX_LOCA_BrussXYZT_BlockDiagonal_MPI_2
NOX_LOCA_TcubedTP_MPI_2
NOX_LOCA_TcubedTP_stratimikos_MPI_2
3 others are failing due to timeout intermittently, about every other day:
NOX_LOCA_BrussXYZT_Sequential_MPI_2
NOX_LOCA_BrussXYZT_SequentialOPS_MPI_2
NOX_LOCA_BrussXYZT_SequentialIPS_MPI_2
All of these timeout failures are using the sems environment it is most common on gnu builds but does happen with intel builds as well. All of them are using openmp.
Steps to Reproduce
Possibly helpful CDASH links:
GNU opt openmp build failures
The text was updated successfully, but these errors were encountered: