Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOX tests timing out in ATDM configuration RHEL6 builds #2628

Closed
fryeguy52 opened this issue Apr 24, 2018 · 22 comments
Closed

NOX tests timing out in ATDM configuration RHEL6 builds #2628

fryeguy52 opened this issue Apr 24, 2018 · 22 comments
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area pkg: NOX type: bug The primary issue is a bug in Trilinos code or tests

Comments

@fryeguy52
Copy link
Contributor

fryeguy52 commented Apr 24, 2018

CC: @trilinos/nox

Next Action Status

Merged PR #2638 fixed all of the timing out tests in the RHEL6 build starting 4/26/2018.

Description

There are 4 NOX tests that are failing due to timeout in the ATDM configuration when building on RHEL6 with the sems environment. They have been failing in this way since 4/18/18. The test are:

NOX_LOCA_BrusselatorHopf_MPI_2
NOX_LOCA_BrussXYZT_BlockDiagonal_MPI_2
NOX_LOCA_TcubedTP_MPI_2
NOX_LOCA_TcubedTP_stratimikos_MPI_2

3 others are failing due to timeout intermittently, about every other day:
NOX_LOCA_BrussXYZT_Sequential_MPI_2
NOX_LOCA_BrussXYZT_SequentialOPS_MPI_2
NOX_LOCA_BrussXYZT_SequentialIPS_MPI_2

All of these timeout failures are using the sems environment it is most common on gnu builds but does happen with intel builds as well. All of them are using openmp.

Steps to Reproduce

  • go to a rhel6 machine that has the sems environment. (CEE machines have sems modules mounted)
  • clone Trilinos and then run the following:
$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-opt-openmp

$ cmake \
  -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
  -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_NOX=ON \
  $TRILINOS_DIR

$ make -j16

$ ctest -j16 

Possibly helpful CDASH links:

GNU opt openmp build failures

@fryeguy52 fryeguy52 added type: bug The primary issue is a bug in Trilinos code or tests pkg: NOX client: ATDM Any issue primarily impacting the ATDM project labels Apr 24, 2018
@fryeguy52
Copy link
Contributor Author

@bartlettroscoe, I just created this issue thought you would like to know

@bartlettroscoe
Copy link
Member

The following queries provide some more information:

@fryeguy52,

What do the totality of those queries seem to indicate? What trends to you see?

@fryeguy52
Copy link
Contributor Author

The specific builds that seem to be effected are:

  • Trilinos-atdm-rhel6-gnu-debug-openmp
  • Trilinos-atdm-rhel6-gnu-opt-openmp
  • Trilinos-atdm-white-ride-cuda-debug-all-at-once

After looking more closely, a few more observations:

  • it seems that this test usually runs in less than 1 minute on other builds but for these builds even when it passes it takes 9+ minutes to complete
  • any NOX_LOCA tests that are timing out in our configuration are using openmp
  • these are not showing up on either shiller/hansen or white ride builds

@etphipp
Copy link
Contributor

etphipp commented Apr 24, 2018

These tests all use Epetra (and not Tpetra). Two questions:

  1. Is Epetra actually configured to use OpenMP?
  2. If so, are the threads being bound properly so that the machine is not being oversubscribed (where each MPI rank is trying to use all of the threads on the machine)?

@bartlettroscoe
Copy link
Member

@fryeguy52,

If you look at the three failures on Trilinos-atdm-white-ride-cuda-debug-all-at-once, they are failures, not timeouts, and they are all on the same day and build at 2018-04-12T06:40:45. Two of those tests showed the error:

--------------------------------------------------------------------------
While computing bindings, we found no available cpus on
the following node:

  Node:  ride7

Please check your allocation.
--------------------------------------------------------------------------

The other test was a setgfault late into running the test:

================================================================
Anasazi Eigensolver starting with block size 1


       Condition number estimate of preconditioner is 1.431e+02
[ride14:28030] *** Process received signal ***
[ride14:28030] Signal: Segmentation fault (11)
[ride14:28030] Signal code: Address not mapped (1)
[ride14:28030] Failing at address: 0x10038a00020
[ride14:28030] [ 0] [0x100000050478]
[ride14:28030] [ 1] [0x3ff0000000000000]
[ride14:28030] *** End of error message ***
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 28030 on node ride14 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

These same failures did not occur for the identical build on 'white' that day.

Given all of this, I think the failures on 'ride' for the build Trilinos-atdm-white-ride-cuda-debug-all-at-once are a problem with that node on 'ride' where these ran.

@bartlettroscoe
Copy link
Member

@etphipp said

If so, are the threads being bound properly so that the machine is not being oversubscribed (where each MPI rank is trying to use all of the threads on the machine)?

Yes, I think that is what is happening as described in #2422, and suspect that the machine is being overloaded.

Looking at:

it seems this is running with OMP_NUM_THREAD=2 and ctest -j16. This should consume 32 total cores.

The Jenkins job:

shows it with a job weight of only 20. Does this means that we are overloading the machine by 12 cores? How many cores does 'hansel' and 'gretel' actually have?

@fryeguy52,

Can you please create a PR that reduces ATDM_CONFIG_CTEST_PARALLEL_LEVEL=16 to ATDM_CONFIG_CTEST_PARALLEL_LEVEL=4 for the OpenMP RHEL6 builds and see what that does? Please leave the Jenkins 'Job weight' at 20 so that no other process tries to run while this build is running.

If we can get this in by tonight, then we can see what happens tomorrow morning.

@bartlettroscoe
Copy link
Member

@trilinos/nox,

Should we disable Epetra support in NOX for ATDM? Do the ATDM APPs need Epetra support in NOX? Should we set NOX_ENABLE_Epetra=OFF?

@etphipp
Copy link
Contributor

etphipp commented Apr 24, 2018

They can use Epetra and NOX, so I wouldn’t do that quite yet.

@wfspotz
Copy link
Contributor

wfspotz commented Apr 24, 2018

I am implementing the NOX solvers in ATDM, and I have been developing them to support Epetra and Tpetra

@rppawlo
Copy link
Contributor

rppawlo commented Apr 25, 2018

I don't think we can actually disable epetra in panzer. That will take some work.

@mhoemmen
Copy link
Contributor

@etphipp wrote:

Is Epetra actually configured to use OpenMP

I think Epetra uses OpenMP for sparse matrix-vector multiplies and axpys by default, as long as OpenMP is enabled in the build.

@mhoemmen
Copy link
Contributor

There is an Epetra CMake option to disable use of OpenMP.

@bartlettroscoe
Copy link
Member

There is an Epetra CMake option to disable use of OpenMP.

It does not look like Epetra has a way to disable OpenMP just for Epetra as shown at:

It is just hard-coded to use ${PROJECT_NAME}_ENABLE_OpenMP.

We could add a proper Epetra_ENABLE_OpenMP option to allow us to turn off OpenMP for Epetra. Should we do that?

@fryeguy52
Copy link
Contributor Author

I rebuilt the same version as yesterday and ran the tests on ceerws1113 (rhel6 cee machine) with ATDM_CONFIG_CTEST_PARALLEL_LEVEL=4. The same tests timed out as yesterday. This is what I did:

source $TRILINOS_DIR/cmake/std/atdm/load-env.sh gnu-opt-openmp 
cmake -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_NOX=ON $TRILINOS_DIR

test result summary:

95% tests passed, 5 tests failed out of 105

Label Time Summary:
NOX    = 7156.33 sec (105 tests)

Total Test time (real) = 1047.22 sec

The following tests FAILED:
	 71 - NOX_LOCA_TcubedTP_MPI_2 (Timeout)
	 72 - NOX_LOCA_TcubedTP_stratimikos_MPI_2 (Timeout)
	 77 - NOX_LOCA_BrusselatorHopf_MPI_2 (Timeout)
	 80 - NOX_LOCA_BrussXYZT_Sequential_MPI_2 (Timeout)
	 84 - NOX_LOCA_BrussXYZT_BlockDiagonal_MPI_2 (Timeout)
Errors while running CTest

@bartlettroscoe
Copy link
Member

@fryeguy52, what happens when you run just:

$ ctest -R NOX_LOCA_TcubedTP_stratimikos_MPI_2 

for example? Does that one test run all by itself timeout? What about the other tests?

@fryeguy52
Copy link
Contributor Author

fryeguy52 commented Apr 25, 2018

run individually they all pass:

1/1 Test #72: NOX_LOCA_TcubedTP_stratimikos_MPI_2 ...   Passed    6.40 sec

1/1 Test #71: NOX_LOCA_TcubedTP_MPI_2 ..........   Passed    4.15 sec

1/1 Test #77: NOX_LOCA_BrusselatorHopf_MPI_2 ...   Passed    1.04 sec

1/1 Test #80: NOX_LOCA_BrussXYZT_Sequential_MPI_2 ...   Passed    1.12 sec

1/1 Test #84: NOX_LOCA_BrussXYZT_BlockDiagonal_MPI_2 ...   Passed    4.83 sec

if I just run ctest with no -j then everything passes:

100% tests passed, 0 tests failed out of 105

Label Time Summary:
NOX    =  72.56 sec (105 tests)

running all tests with ctest -j4:

Test  #84: NOX_LOCA_BrussXYZT_BlockDiagonal_MPI_2 ......................***Timeout 600.06 sec

@mhoemmen
Copy link
Contributor

@bartlettroscoe wrote:

We could add a proper Epetra_ENABLE_OpenMP option to allow us to turn off OpenMP for Epetra. Should we do that?

It's worth trying, at least to see if that fixes the tests.

@bartlettroscoe
Copy link
Member

if I just run ctest with no -j then everything passes
...
running all tests with ctest -j4:
[timeouts]

Wow, that is unfortunate. Even with all other thread bounds to the same core, I don't see how a test that takes 4.8s to complete can timeout at 10 minutes!

Let's try disabling OpenMP in Epetra and see what happens.

@fryeguy52, I will create a branch of Trilinos that disables OpenMP for our ATDM builds and then point you to it to test.

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

I did an atdm build on my (nearly rhel6) machine and was able to reproduce the behavior, and I think I see what the issue is and how to resolve it without disabling OpenMP.

By default, mpirun from OpenMPI bounds processes to cores if the number of processes <= 2. When you run ctest -jX with X at least 4, then multiple 2-MPI rank tests are run simultaneously, and it appears from watching the process monitor on my machine, bound to the same two cores. Each MPI rank is then running 2 threads, so you end up with 4 threads trying to run on the same core simultaneously.

A simple solution is to not have mpirun bound the MPI ranks to cores by adding the configure option:

-D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none"

This will free the OS to move the processes around to balance the machine. Normally this a bad thing to do because each MPI rank will then try and run OpenMP threads across the whole machine, but since you are setting OMP_NUM_THREADS=2, each MPI rank will only run 2 threads. This fixed the timeouts on my machine, and also made the whole test suite run a lot faster (took about 10s to run all 105 tests using "ctest -j32" on my 32 core machine).

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue Apr 25, 2018
This solved the timeouts for the NOX test suite for me on ceerws1113.
@bartlettroscoe
Copy link
Member

A simple solution is to not have mpirun bound the MPI ranks to cores by adding the configure option:

@etphipp, Thanks! That did the trick for me too. The PR #2638 applies this fix and I verified that it fixed the NOX tests. Can you go ahead and approve the PR?

Is this an initial solution to the OpenMP binding problem described in #2422, to allow us to enable OpenMP in an auto PR build?

@etphipp
Copy link
Contributor

etphipp commented Apr 25, 2018

Great! I approved the PR.

Is this an initial solution to the OpenMP binding problem described in #2422, to allow us to enable OpenMP in an auto PR build?

I suppose its an initial solution, but I don't see it being an ideal one. Ideally ctest would correctly bind MPI ranks to separate cores when launching multiple tests simultaneously, and respect the number of threads each test is using.

@bartlettroscoe
Copy link
Member

After the merge of PR #2638, there are no more timeouts at all in these ATDM Trilinos RHEL6 builds. for example, you can see all the newly passing tests for the build Trilinos-atdm-rhel6-gnu-debug-openmp today at:

See the +7' superscript on top of the 105` passing tests for NOX?

Now there is just one failing Teko test and then all of these builds can be promoted to the "ATDM" CDash Track/Group.

Closing this issue as complete!

@bartlettroscoe bartlettroscoe added the PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area label Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
client: ATDM Any issue primarily impacting the ATDM project PA: Nonlinear Solvers Issues that fall under the Trilinos Nonlinear Linear Solvers Product Area pkg: NOX type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

6 participants