Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define better strategy for managing threaded testing with Trilinos #2422

Open
bartlettroscoe opened this issue Mar 20, 2018 · 39 comments
Open
Assignees
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: Kokkos pkg: Tpetra type: enhancement Issue is an enhancement, not a bug

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Mar 20, 2018

CC:: @trilinos/framework, @trilinos/kokkos, @trilinos/tpetra, @nmhamster, @rppawlo

Next Action Status

Set up a meeting to discuss current status of threaded testing in Trilinos and some steps to try to address the issues ...

Description

It seems the testing strategy in Trilinos to test with threading is to build threaded code and then run all of the threaded tests with same number of threads (such as by setting export OMP_NUM_THREADS=2 when using OpenMP ) and then running the test suite with ctest -j<N> with that fixed number of threads. But this approach, and testing with threads enabled in general, has some issues.

First, with some configurations and systems, running with any <N> with ctest -j<N> will result in all of the test executables binding to the same threads on the same cores making things run very slowly such as described in #2398 (comment). A similar problem also occurs with CUDA builds with the various test processes running concurrently not spreading the load across the available GPUs (see #2446 (comment)).

Second, even when one does not experience the above problem of binding to the same cores (which is not always a problem), this approach does not make very good usage the test machine because it assumes that every MPI process is multi-threaded with Kokkos, which is not true. Even when OPM_NUM_THREADS > 1, there are a lot of Trilinos tests that don't have any threaded code and even if ctest allocates room for 2 threads per MPI process, only one thread will be used. So this will result in not keeping many of the cores busy running code and therefore making in tests take longer to complete.

The impact of the two problems above has many developers and many automated builds having to run with a small ctest -j<N> (i.e.g ctest -j8 is used on many of the ATDM Trilinos builds) and therefore not utilizing many of the cores that are available. This results in the time to run the full test suite going up significantly. This negatively impacts developer productivity (because they have to wait longer to get feedback from running tests locally) and this wastes existing testing hardware and/or limiting the number of builds and the number of tests that can be run in a given testing day (which reduces the number of defect that we can catch and therefore costs Trilinos developers and users time and $$).

Third, having to run the entire Trilinos test suite with a fixed number of threads like with export OMP_NUM_THREADS=2 or export OMP_NUM_THREADS=4 does not result in very good testing or results in very expensive testing having to run the entire Trilinos test suite multiple times. It has been observed that defects occur when some thread counts are used like export OMP_NUM_THREADS=5, for example. This would be like having every MPI test in Trilinos run with exactly the same number of MPI processes which would not result in very good testing (which is not the case in Trilinos as several tests are run with different numbers of MPI processes).

Ideas for Possible Solutions

First, ctest needs to be extended in order to inform it of the architecture of the system were it will be running tests. CTest needs to know the number of sockets per node, the number of cores per socket, the number of threads per node, and the number of nodes. We will also need to inform CTest about the number of MPI ranks vs. threads per MPI rank for each test (i.e. add a THREAD_PERS_PROCESS property in addition to a PROCESSORS property).. With that type of information, ctest should be able to determine the binding of the different ranks in an MPI job that runs a test to specific cores on sockets on nodes. And we will need to find a way to communicate this information to the MPI jobs when they are run by ctest. I think this means adding the types of process process affinity and process placement like you see from modern MPI implementations (see https://github.com/open-mpi/ompi/wiki/ProcessAffinity and https://github.com/open-mpi/ompi/wiki/ProcessPlacement ). See this Kitware backlog item.

Second, we should investigate how to add a NUM_THREADS_PER_PROC <numThreadsPerProc> argument to the TRIBITS_ADD_TEST() and TRIBITS_ADD_ADVANCED_TEST() commands. It would be good if this could be added directly to these TriBITS functions and provide some type of "plugin" system to allow us to define how the number of threads gets set when running the individual test. But the default TriBITS implementation could just compute NUM_TOTAL_CORES_USED <numTotalCoresUsed> from <numThreadsPerProc> * <numMpiProcs>.

The specialization of this new TriBITS functionality for Kokkos/Trilinos would set the number of requested threads based on the enabled threading model known at configure time. For OpenMP, it would set the env var OMP_NUM_THREADS=<numThreads> and for other threading models would pass in --kokkos-threads=<numThreads>. If the NUM_THREADS_PER_PROC <numThreadsPerProc> argument was missing, this could use a default number of threads (e.g. global configure argument Trilinos_DEFAULT_NUM_THREADS with default 1). If the computed <numTotalCoresUsed> was larger than ${MPI_EXEC_MAX_NUMPROCS} (which should be set to the maximum number of threads that can be used run on that machine when using threading), then the test would get excluded and a message would be printed to STDOUT. Such a CMake function should be pretty easy to write, if you know the threading model used by Kokkos at configure time.

Definition of Done:

  • ???

Tasks:

  1. ???
@bartlettroscoe bartlettroscoe added type: enhancement Issue is an enhancement, not a bug pkg: Kokkos pkg: Tpetra client: ATDM Any issue primarily impacting the ATDM project labels Mar 20, 2018
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Mar 20, 2018

The creation of this issue came out of a conversation I had with @nmhamster last week about the issues with threaded testing of Trilinos and its impact on ATDM. The issue of testing the new ctest property PROCESSOR_AFFINITY is fairly urgent because this is in the current CMake git repo 'master' branch which means that it will go out in CMake 3.12 as-is and if we don't fix any problems with it it may be hard to change after that. Also, of we are going to enable OpenMP in the CI or the auto PR build, we need to make sure we can run with tests in parallel so that we are not stuck running with ctest -j1 due to the thread binding issue mentioned above and described by @crtrott in #2398 (comment). So we need to get on this this testing ASAP.

@bartlettroscoe
Copy link
Member Author

Below is some detailed info from @rppawlo about how to reproduce the binding of threads in multiple MPI processes to the same core.

@rppawlo,

Can you attach your complete do-configure script for this build? Otherwise, hopefully this is a simple as using the standard SEMS CI build using:

$ cmake \
  [standard CI options] \
  -D Trilinos_ENABLE_OpenMP=ON \
  -D MPI_EXEC_POST_NUMPROCS_FLAGS="-bind-to;core;-map-by;core" \
  -D Trilinos_ENABLE_Panzer=ON -DTrilinos_ENABLE_TESTS=ON \ 
  <trilinosDir>

$ make -j16

$ export OMP_NUM_THREADS=2

$ ctest -E ConvTest [-j16]

Below are the panzer test results for the corresponding configure file that I sent earlier. I ran with –E ConvTest to turn off the costly convergence tests. HWLOC was enabled for an OpenMP Kokkos build and mpi configured with:

-D MPI_EXEC_POST_NUMPROCS_FLAGS="-bind-to;core;-map-by;core" \

I exported OMP_NUM_THREADS=2 for all tests. This is a xeon 36 core node (72 with hyperthreads).

Without specifying the –j flag, the tests finished in 131 seconds, running one at a time. Running same tests with –j16 took 1119 seconds. The output test timings for each test is below so you can compare.

[rppawlo@gge BUILD]$ hwloc-info 
depth 0: 1 Machine (type #1)
 depth 1:               2 NUMANode (type #2)
  depth 2:              2 Package (type #3)
   depth 3:             2 L3Cache (type #4)
    depth 4:            36 L2Cache (type #4)
     depth 5:           36 L1dCache (type #4)
      depth 6:          36 L1iCache (type #4)
       depth 7:         36 Core (type #5)
        depth 8:        72 PU (type #6)
Special depth -3: 5 Bridge (type #9)
Special depth -4: 7 PCI Device (type #10)
Special depth -5: 4 OS Device (type #11)

Running with ctest -j1:

`ctest -E ConvTest` results (click to expand)
[rppawlo@gge panzer]$ ctest -E ConvTest 
Test project /ascldap/users/rppawlo/BUILD/packages/panzer
        Start   1: PanzerCore_version_MPI_1
  1/132 Test   #1: PanzerCore_version_MPI_1 .........................................   Passed    0.16 sec
        Start   2: PanzerCore_string_utilities_MPI_1
  2/132 Test   #2: PanzerCore_string_utilities_MPI_1 ................................   Passed    0.14 sec
        Start   3: PanzerCore_hash_utilities_MPI_1
  3/132 Test   #3: PanzerCore_hash_utilities_MPI_1 ..................................   Passed    0.14 sec
        Start   4: PanzerCore_memUtils_MPI_1
  4/132 Test   #4: PanzerCore_memUtils_MPI_1 ........................................   Passed    0.14 sec
        Start   5: PanzerDofMgr_tFieldPattern_MPI_4
  5/132 Test   #5: PanzerDofMgr_tFieldPattern_MPI_4 .................................   Passed    0.25 sec
        Start   6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4
  6/132 Test   #6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4 .....................   Passed    0.26 sec
        Start   7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4
  7/132 Test   #7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4 .........................   Passed    0.25 sec
        Start   8: PanzerDofMgr_tNodalFieldPattern_MPI_4
  8/132 Test   #8: PanzerDofMgr_tNodalFieldPattern_MPI_4 ............................   Passed    0.25 sec
        Start   9: PanzerDofMgr_tFieldAggPattern_MPI_4
  9/132 Test   #9: PanzerDofMgr_tFieldAggPattern_MPI_4 ..............................   Passed    0.25 sec
        Start  10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2
 10/132 Test  #10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2 .................   Passed    0.25 sec
        Start  11: PanzerDofMgr_tBlockedDOFManager_MPI_2
 11/132 Test  #11: PanzerDofMgr_tBlockedDOFManager_MPI_2 ............................   Passed    0.24 sec
        Start  12: PanzerDofMgr_tOrientations_MPI_1
 12/132 Test  #12: PanzerDofMgr_tOrientations_MPI_1 .................................   Passed    0.23 sec
        Start  13: PanzerDofMgr_tFilteredUGI_MPI_2
 13/132 Test  #13: PanzerDofMgr_tFilteredUGI_MPI_2 ..................................   Passed    0.24 sec
        Start  14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4
 14/132 Test  #14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4 ..................   Passed    0.27 sec
        Start  15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4
 15/132 Test  #15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4 ....................   Passed    0.26 sec
        Start  16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4
 16/132 Test  #16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4 ...........................   Passed    0.26 sec
        Start  17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4
 17/132 Test  #17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4 ....................   Passed    0.25 sec
        Start  18: PanzerDofMgr_tFieldPattern2_MPI_4
 18/132 Test  #18: PanzerDofMgr_tFieldPattern2_MPI_4 ................................   Passed    0.25 sec
        Start  19: PanzerDofMgr_tFieldAggPattern2_MPI_4
 19/132 Test  #19: PanzerDofMgr_tFieldAggPattern2_MPI_4 .............................   Passed    0.25 sec
        Start  20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4
 20/132 Test  #20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4 ...........................   Passed    0.26 sec
        Start  21: PanzerDofMgr_scaling_test
 21/132 Test  #21: PanzerDofMgr_scaling_test ........................................   Passed    1.23 sec
        Start  22: PanzerDiscFE_integration_rule_MPI_1
 22/132 Test  #22: PanzerDiscFE_integration_rule_MPI_1 ..............................   Passed    0.28 sec
        Start  23: PanzerDiscFE_integration_values2_MPI_1
 23/132 Test  #23: PanzerDiscFE_integration_values2_MPI_1 ...........................   Passed    0.28 sec
        Start  24: PanzerDiscFE_dimension_MPI_1
 24/132 Test  #24: PanzerDiscFE_dimension_MPI_1 .....................................   Passed    0.27 sec
        Start  25: PanzerDiscFE_basis_MPI_1
 25/132 Test  #25: PanzerDiscFE_basis_MPI_1 .........................................   Passed    0.27 sec
        Start  26: PanzerDiscFE_basis_values2_MPI_1
 26/132 Test  #26: PanzerDiscFE_basis_values2_MPI_1 .................................   Passed    0.30 sec
        Start  27: PanzerDiscFE_point_values2_MPI_1
 27/132 Test  #27: PanzerDiscFE_point_values2_MPI_1 .................................   Passed    0.28 sec
        Start  28: PanzerDiscFE_boundary_condition_MPI_1
 28/132 Test  #28: PanzerDiscFE_boundary_condition_MPI_1 ............................   Passed    0.28 sec
        Start  29: PanzerDiscFE_material_model_entry_MPI_1
 29/132 Test  #29: PanzerDiscFE_material_model_entry_MPI_1 ..........................   Passed    0.27 sec
        Start  30: PanzerDiscFE_stlmap_utilities_MPI_1
 30/132 Test  #30: PanzerDiscFE_stlmap_utilities_MPI_1 ..............................   Passed    0.27 sec
        Start  31: PanzerDiscFE_shards_utilities_MPI_1
 31/132 Test  #31: PanzerDiscFE_shards_utilities_MPI_1 ..............................   Passed    0.27 sec
        Start  32: PanzerDiscFE_evaluators_MPI_1
 32/132 Test  #32: PanzerDiscFE_evaluators_MPI_1 ....................................   Passed    0.28 sec
        Start  33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1
 33/132 Test  #33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1 ............   Passed    0.27 sec
        Start  34: PanzerDiscFE_zero_sensitivities_MPI_1
 34/132 Test  #34: PanzerDiscFE_zero_sensitivities_MPI_1 ............................   Passed    0.27 sec
        Start  35: PanzerDiscFE_output_stream_MPI_1
 35/132 Test  #35: PanzerDiscFE_output_stream_MPI_1 .................................   Passed    0.27 sec
        Start  36: PanzerDiscFE_global_data_MPI_1
 36/132 Test  #36: PanzerDiscFE_global_data_MPI_1 ...................................   Passed    0.27 sec
        Start  37: PanzerDiscFE_parameter_library_MPI_1
 37/132 Test  #37: PanzerDiscFE_parameter_library_MPI_1 .............................   Passed    0.28 sec
        Start  38: PanzerDiscFE_cell_topology_info_MPI_1
 38/132 Test  #38: PanzerDiscFE_cell_topology_info_MPI_1 ............................   Passed    0.27 sec
        Start  39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1
 39/132 Test  #39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1 ................   Passed    0.27 sec
        Start  40: PanzerDiscFE_view_factory_MPI_1
 40/132 Test  #40: PanzerDiscFE_view_factory_MPI_1 ..................................   Passed    0.27 sec
        Start  41: PanzerDiscFE_check_bc_consistency_MPI_1
 41/132 Test  #41: PanzerDiscFE_check_bc_consistency_MPI_1 ..........................   Passed    0.27 sec
        Start  42: PanzerDiscFE_equation_set_MPI_1
 42/132 Test  #42: PanzerDiscFE_equation_set_MPI_1 ..................................   Passed    0.27 sec
        Start  43: PanzerDiscFE_equation_set_composite_factory_MPI_1
 43/132 Test  #43: PanzerDiscFE_equation_set_composite_factory_MPI_1 ................   Passed    0.28 sec
        Start  44: PanzerDiscFE_closure_model_MPI_1
 44/132 Test  #44: PanzerDiscFE_closure_model_MPI_1 .................................   Passed    0.28 sec
        Start  45: PanzerDiscFE_closure_model_composite_MPI_1
 45/132 Test  #45: PanzerDiscFE_closure_model_composite_MPI_1 .......................   Passed    0.28 sec
        Start  46: PanzerDiscFE_physics_block_MPI_1
 46/132 Test  #46: PanzerDiscFE_physics_block_MPI_1 .................................   Passed    0.29 sec
        Start  47: PanzerDiscFE_tEpetraGather_MPI_4
 47/132 Test  #47: PanzerDiscFE_tEpetraGather_MPI_4 .................................   Passed    0.30 sec
        Start  48: PanzerDiscFE_tEpetraScatter_MPI_4
 48/132 Test  #48: PanzerDiscFE_tEpetraScatter_MPI_4 ................................   Passed    0.30 sec
        Start  49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4
 49/132 Test  #49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4 .......................   Passed    0.30 sec
        Start  50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2
 50/132 Test  #50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2 ........................   Passed    0.37 sec
        Start  51: PanzerDiscFE_tCloneLOF_MPI_2
 51/132 Test  #51: PanzerDiscFE_tCloneLOF_MPI_2 .....................................   Passed    0.30 sec
        Start  52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2
 52/132 Test  #52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2 .......................   Passed    0.30 sec
        Start  53: PanzerDiscFE_NormalsEvaluator_MPI_1
 53/132 Test  #53: PanzerDiscFE_NormalsEvaluator_MPI_1 ..............................   Passed    0.28 sec
        Start  54: PanzerDiscFE_IntegratorScalar_MPI_1
 54/132 Test  #54: PanzerDiscFE_IntegratorScalar_MPI_1 ..............................   Passed    0.28 sec
        Start  55: PanzerDiscFE_GatherCoordinates_MPI_1
 55/132 Test  #55: PanzerDiscFE_GatherCoordinates_MPI_1 .............................   Passed    0.28 sec
        Start  56: PanzerDiscFE_DOF_PointFields_MPI_1
 56/132 Test  #56: PanzerDiscFE_DOF_PointFields_MPI_1 ...............................   Passed    0.28 sec
        Start  57: PanzerDiscFE_DOF_BasisToBasis_MPI_1
 57/132 Test  #57: PanzerDiscFE_DOF_BasisToBasis_MPI_1 ..............................   Passed    0.28 sec
        Start  58: PanzerDiscFE_point_descriptor_MPI_1
 58/132 Test  #58: PanzerDiscFE_point_descriptor_MPI_1 ..............................   Passed    0.27 sec
        Start  59: PanzerAdaptersSTK_tSTKInterface_MPI_1
 59/132 Test  #59: PanzerAdaptersSTK_tSTKInterface_MPI_1 ............................   Passed    1.17 sec
        Start  60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2
 60/132 Test  #60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2 .........................   Passed    1.18 sec
        Start  61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2
 61/132 Test  #61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2 ...................   Passed    1.25 sec
        Start  62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2
 62/132 Test  #62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2 ....................   Passed    1.18 sec
        Start  63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2
 63/132 Test  #63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2 ......................   Passed    2.10 sec
        Start  64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2
 64/132 Test  #64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2 ......................   Passed    1.33 sec
        Start  65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4
 65/132 Test  #65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4 ...........   Passed    1.19 sec
        Start  66: PanzerAdaptersSTK_tSTK_IO_MPI_1
 66/132 Test  #66: PanzerAdaptersSTK_tSTK_IO_MPI_1 ..................................   Passed    1.22 sec
        Start  67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2
 67/132 Test  #67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2 .....................   Passed    1.22 sec
        Start  68: PanzerAdaptersSTK_tGhosting_MPI_4
 68/132 Test  #68: PanzerAdaptersSTK_tGhosting_MPI_4 ................................   Passed    1.19 sec
        Start  69: PanzerAdaptersSTK_tSTKConnManager_MPI_2
 69/132 Test  #69: PanzerAdaptersSTK_tSTKConnManager_MPI_2 ..........................   Passed    1.19 sec
        Start  70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2
 70/132 Test  #70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2 ................   Passed    1.22 sec
        Start  71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2
 71/132 Test  #71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2 .................   Passed    1.18 sec
        Start  72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2
 72/132 Test  #72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2 .................   Passed    1.19 sec
        Start  73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2
 73/132 Test  #73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2 .....................   Passed    1.18 sec
        Start  74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2
 74/132 Test  #74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2 ...................   Passed    1.22 sec
        Start  75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1
 75/132 Test  #75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1 ......   Passed    1.16 sec
        Start  76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2
 76/132 Test  #76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2 ................   Passed    1.15 sec
        Start  77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4
 77/132 Test  #77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 .................   Passed    1.47 sec
        Start  78: PanzerAdaptersSTK_workset_builder_MPI_1
 78/132 Test  #78: PanzerAdaptersSTK_workset_builder_MPI_1 ..........................   Passed    1.21 sec
        Start  79: PanzerAdaptersSTK_d_workset_builder_MPI_2
 79/132 Test  #79: PanzerAdaptersSTK_d_workset_builder_MPI_2 ........................   Passed    1.20 sec
        Start  80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1
 80/132 Test  #80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1 .....................   Passed    1.17 sec
        Start  81: PanzerAdaptersSTK_cascade_MPI_2
 81/132 Test  #81: PanzerAdaptersSTK_cascade_MPI_2 ..................................   Passed    1.17 sec
        Start  82: PanzerAdaptersSTK_hdiv_basis_MPI_1
 82/132 Test  #82: PanzerAdaptersSTK_hdiv_basis_MPI_1 ...............................   Passed    3.45 sec
        Start  83: PanzerAdaptersSTK_workset_container_MPI_2
 83/132 Test  #83: PanzerAdaptersSTK_workset_container_MPI_2 ........................   Passed    1.27 sec
        Start  84: PanzerAdaptersSTK_field_manager_builder_MPI_1
 84/132 Test  #84: PanzerAdaptersSTK_field_manager_builder_MPI_1 ....................   Passed    1.18 sec
        Start  85: PanzerAdaptersSTK_initial_condition_builder_MPI_1
 85/132 Test  #85: PanzerAdaptersSTK_initial_condition_builder_MPI_1 ................   Passed    1.18 sec
        Start  86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2
 86/132 Test  #86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2 ...............   Passed    1.22 sec
        Start  87: PanzerAdaptersSTK_initial_condition_control_MPI_2
 87/132 Test  #87: PanzerAdaptersSTK_initial_condition_control_MPI_2 ................   Passed    1.21 sec
        Start  88: PanzerAdaptersSTK_assembly_engine_MPI_4
 88/132 Test  #88: PanzerAdaptersSTK_assembly_engine_MPI_4 ..........................   Passed    1.90 sec
        Start  89: PanzerAdaptersSTK_simple_bc_MPI_2
 89/132 Test  #89: PanzerAdaptersSTK_simple_bc_MPI_2 ................................   Passed    1.24 sec
        Start  90: PanzerAdaptersSTK_model_evaluator_MPI_4
 90/132 Test  #90: PanzerAdaptersSTK_model_evaluator_MPI_4 ..........................   Passed    1.39 sec
        Start  91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1
 91/132 Test  #91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1 ...............   Passed    1.17 sec
        Start  92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4
 92/132 Test  #92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4 ....................   Passed    1.96 sec
        Start  93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4
 93/132 Test  #93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4 .................   Passed    1.25 sec
        Start  94: PanzerAdaptersSTK_response_residual_MPI_2
 94/132 Test  #94: PanzerAdaptersSTK_response_residual_MPI_2 ........................   Passed    1.58 sec
        Start  95: PanzerAdaptersSTK_solver_MPI_4
 95/132 Test  #95: PanzerAdaptersSTK_solver_MPI_4 ...................................   Passed    1.29 sec
        Start  96: PanzerAdaptersSTK_gs_evaluators_MPI_1
 96/132 Test  #96: PanzerAdaptersSTK_gs_evaluators_MPI_1 ............................   Passed    1.20 sec
        Start  97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1
 97/132 Test  #97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1 ..................   Passed    1.20 sec
        Start  98: PanzerAdaptersSTK_periodic_bcs_MPI_4
 98/132 Test  #98: PanzerAdaptersSTK_periodic_bcs_MPI_4 .............................   Passed    1.28 sec
        Start  99: PanzerAdaptersSTK_periodic_mesh_MPI_2
 99/132 Test  #99: PanzerAdaptersSTK_periodic_mesh_MPI_2 ............................   Passed    1.35 sec
        Start 100: PanzerAdaptersSTK_bcstrategy_MPI_1
100/132 Test #100: PanzerAdaptersSTK_bcstrategy_MPI_1 ...............................   Passed    1.17 sec
        Start 101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1
101/132 Test #101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1 .............   Passed    1.13 sec
        Start 102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2
102/132 Test #102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2 .................   Passed    1.28 sec
        Start 103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2
103/132 Test #103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2 ...................   Passed    1.23 sec
        Start 104: PanzerAdaptersSTK_ip_coordinates_MPI_2
104/132 Test #104: PanzerAdaptersSTK_ip_coordinates_MPI_2 ...........................   Passed    1.20 sec
        Start 105: PanzerAdaptersSTK_tGatherSolution_MPI_2
105/132 Test #105: PanzerAdaptersSTK_tGatherSolution_MPI_2 ..........................   Passed    1.19 sec
        Start 106: PanzerAdaptersSTK_tScatterResidual_MPI_2
106/132 Test #106: PanzerAdaptersSTK_tScatterResidual_MPI_2 .........................   Passed    1.18 sec
        Start 107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2
107/132 Test #107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2 ................   Passed    1.19 sec
        Start 108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1
108/132 Test #108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1 ........................   Passed    1.16 sec
        Start 109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1
109/132 Test #109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1 ...............   Passed    1.17 sec
        Start 110: PanzerAdaptersSTK_node_normals_MPI_2
110/132 Test #110: PanzerAdaptersSTK_node_normals_MPI_2 .............................   Passed    1.24 sec
        Start 111: PanzerAdaptersSTK_tFaceToElem_MPI_2
111/132 Test #111: PanzerAdaptersSTK_tFaceToElem_MPI_2 ..............................   Passed    1.31 sec
        Start 112: PanzerAdaptersSTK_square_mesh_MPI_4
112/132 Test #112: PanzerAdaptersSTK_square_mesh_MPI_4 ..............................   Passed    1.20 sec
        Start 113: PanzerAdaptersSTK_square_mesh_bc_MPI_4
113/132 Test #113: PanzerAdaptersSTK_square_mesh_bc_MPI_4 ...........................   Passed    1.18 sec
        Start 114: PanzerAdaptersSTK_CurlLaplacianExample
114/132 Test #114: PanzerAdaptersSTK_CurlLaplacianExample ...........................   Passed    2.61 sec
        Start 115: PanzerAdaptersSTK_MixedPoissonExample
115/132 Test #115: PanzerAdaptersSTK_MixedPoissonExample ............................   Passed    2.86 sec
        Start 116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4
116/132 Test #116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 ...............   Passed    3.40 sec
        Start 117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1
117/132 Test #117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 ...   Passed    5.70 sec
        Start 118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4
118/132 Test #118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4 ...............   Passed    5.68 sec
        Start 119: PanzerAdaptersSTK_assembly_example_MPI_4
119/132 Test #119: PanzerAdaptersSTK_assembly_example_MPI_4 .........................   Passed    1.29 sec
        Start 120: PanzerAdaptersSTK_main_driver_energy-ss
120/132 Test #120: PanzerAdaptersSTK_main_driver_energy-ss ..........................   Passed    2.89 sec
        Start 121: PanzerAdaptersSTK_main_driver_energy-transient
121/132 Test #121: PanzerAdaptersSTK_main_driver_energy-transient ...................   Passed    1.63 sec
        Start 122: PanzerAdaptersSTK_main_driver_energy-ss-blocked
122/132 Test #122: PanzerAdaptersSTK_main_driver_energy-ss-blocked ..................   Passed    1.55 sec
        Start 123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue
123/132 Test #123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue ..........   Passed    1.70 sec
        Start 124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp
124/132 Test #124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp ...............   Passed    2.82 sec
        Start 125: PanzerAdaptersSTK_main_driver_energy-neumann
125/132 Test #125: PanzerAdaptersSTK_main_driver_energy-neumann .....................   Passed    1.27 sec
        Start 126: PanzerAdaptersSTK_main_driver_meshmotion
126/132 Test #126: PanzerAdaptersSTK_main_driver_meshmotion .........................   Passed    2.05 sec
        Start 127: PanzerAdaptersSTK_main_driver_energy-transient-blocked
127/132 Test #127: PanzerAdaptersSTK_main_driver_energy-transient-blocked ...........   Passed    1.89 sec
        Start 128: PanzerAdaptersSTK_me_main_driver_energy-ss
128/132 Test #128: PanzerAdaptersSTK_me_main_driver_energy-ss .......................   Passed    1.27 sec
        Start 129: PanzerAdaptersSTK_siamCse17_MPI_4
129/132 Test #129: PanzerAdaptersSTK_siamCse17_MPI_4 ................................   Passed    1.26 sec
        Start 130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1
130/132 Test #130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1 .......................   Passed    1.09 sec
        Start 131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2
131/132 Test #131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2 .......................   Passed    1.08 sec
        Start 132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3
132/132 Test #132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3 .......................   Passed    1.07 sec

100% tests passed, 0 tests failed out of 132

Label Time Summary:
Panzer    = 131.27 sec (132 tests)

Total Test time (real) = 132.51 sec

Running with ctest -j16:

`ctest -E ConvTest -j16` results (click to expand)
[rppawlo@gge panzer]$ ctest -E ConvTest -j16
Test project /ascldap/users/rppawlo/BUILD/packages/panzer
        Start 117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1
        Start 118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4
        Start  82: PanzerAdaptersSTK_hdiv_basis_MPI_1
        Start 116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4
        Start 120: PanzerAdaptersSTK_main_driver_energy-ss
        Start  63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2
  1/132 Test  #63: PanzerAdaptersSTK_tCubeHexMeshFactory_MPI_2 ......................   Passed   11.92 sec
        Start 126: PanzerAdaptersSTK_main_driver_meshmotion
  2/132 Test  #82: PanzerAdaptersSTK_hdiv_basis_MPI_1 ...............................   Passed   24.25 sec
        Start 125: PanzerAdaptersSTK_main_driver_energy-neumann
  3/132 Test #125: PanzerAdaptersSTK_main_driver_energy-neumann .....................   Passed   29.45 sec
        Start  66: PanzerAdaptersSTK_tSTK_IO_MPI_1
  4/132 Test  #66: PanzerAdaptersSTK_tSTK_IO_MPI_1 ..................................   Passed    6.63 sec
        Start  78: PanzerAdaptersSTK_workset_builder_MPI_1
  5/132 Test  #78: PanzerAdaptersSTK_workset_builder_MPI_1 ..........................   Passed   32.46 sec
        Start  97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1
  6/132 Test  #97: PanzerAdaptersSTK_scatter_field_evaluator_MPI_1 ..................   Passed    5.41 sec
        Start  96: PanzerAdaptersSTK_gs_evaluators_MPI_1
  7/132 Test  #96: PanzerAdaptersSTK_gs_evaluators_MPI_1 ............................   Passed    6.01 sec
        Start  85: PanzerAdaptersSTK_initial_condition_builder_MPI_1
  8/132 Test  #85: PanzerAdaptersSTK_initial_condition_builder_MPI_1 ................   Passed   23.44 sec
        Start  84: PanzerAdaptersSTK_field_manager_builder_MPI_1
  9/132 Test  #84: PanzerAdaptersSTK_field_manager_builder_MPI_1 ....................   Passed   21.65 sec
        Start  91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1
 10/132 Test  #91: PanzerAdaptersSTK_model_evaluator_mass_check_MPI_1 ...............   Passed   10.72 sec
        Start  59: PanzerAdaptersSTK_tSTKInterface_MPI_1
 11/132 Test  #59: PanzerAdaptersSTK_tSTKInterface_MPI_1 ............................   Passed    6.51 sec
        Start 100: PanzerAdaptersSTK_bcstrategy_MPI_1
 12/132 Test #100: PanzerAdaptersSTK_bcstrategy_MPI_1 ...............................   Passed   18.03 sec
        Start 109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1
 13/132 Test #109: PanzerAdaptersSTK_tPointBasisValuesEvaluator_MPI_1 ...............   Passed   20.94 sec
        Start  80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1
 14/132 Test  #80: PanzerAdaptersSTK_d_workset_builder_3d_MPI_1 .....................   Passed    6.51 sec
        Start 108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1
 15/132 Test #126: PanzerAdaptersSTK_main_driver_meshmotion .........................   Passed  208.12 sec
        Start  94: PanzerAdaptersSTK_response_residual_MPI_2
 16/132 Test #108: PanzerAdaptersSTK_tBasisTimesVector_MPI_1 ........................   Passed    7.62 sec
        Start  75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1
 17/132 Test  #75: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_edgetests_MPI_1 ......   Passed    8.32 sec
        Start 101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1
 18/132 Test #101: PanzerAdaptersSTK_bcstrategy_composite_factory_MPI_1 .............   Passed    6.51 sec
        Start 130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1
 19/132 Test #130: PanzerAdaptersIOSS_tIOSSConnManager1_MPI_1 .......................   Passed    8.22 sec
        Start  26: PanzerDiscFE_basis_values2_MPI_1
 20/132 Test  #26: PanzerDiscFE_basis_values2_MPI_1 .................................   Passed    1.60 sec
        Start  46: PanzerDiscFE_physics_block_MPI_1
 21/132 Test  #46: PanzerDiscFE_physics_block_MPI_1 .................................   Passed    1.60 sec
        Start  22: PanzerDiscFE_integration_rule_MPI_1
 22/132 Test  #22: PanzerDiscFE_integration_rule_MPI_1 ..............................   Passed    1.60 sec
        Start  56: PanzerDiscFE_DOF_PointFields_MPI_1
 23/132 Test  #56: PanzerDiscFE_DOF_PointFields_MPI_1 ...............................   Passed    8.22 sec
        Start  55: PanzerDiscFE_GatherCoordinates_MPI_1
 24/132 Test  #55: PanzerDiscFE_GatherCoordinates_MPI_1 .............................   Passed    1.60 sec
        Start  23: PanzerDiscFE_integration_values2_MPI_1
 25/132 Test  #23: PanzerDiscFE_integration_values2_MPI_1 ...........................   Passed    4.01 sec
        Start  54: PanzerDiscFE_IntegratorScalar_MPI_1
 26/132 Test  #54: PanzerDiscFE_IntegratorScalar_MPI_1 ..............................   Passed    1.60 sec
        Start  53: PanzerDiscFE_NormalsEvaluator_MPI_1
 27/132 Test  #94: PanzerAdaptersSTK_response_residual_MPI_2 ........................   Passed   47.88 sec
        Start  99: PanzerAdaptersSTK_periodic_mesh_MPI_2
 28/132 Test  #53: PanzerDiscFE_NormalsEvaluator_MPI_1 ..............................   Passed    0.98 sec
        Start  57: PanzerDiscFE_DOF_BasisToBasis_MPI_1
 29/132 Test  #57: PanzerDiscFE_DOF_BasisToBasis_MPI_1 ..............................   Passed    2.20 sec
        Start  32: PanzerDiscFE_evaluators_MPI_1
 30/132 Test  #32: PanzerDiscFE_evaluators_MPI_1 ....................................   Passed    1.60 sec
        Start  27: PanzerDiscFE_point_values2_MPI_1
 31/132 Test  #27: PanzerDiscFE_point_values2_MPI_1 .................................   Passed    1.60 sec
        Start  44: PanzerDiscFE_closure_model_MPI_1
 32/132 Test #120: PanzerAdaptersSTK_main_driver_energy-ss ..........................   Passed  275.85 sec
        Start 115: PanzerAdaptersSTK_MixedPoissonExample
 33/132 Test  #44: PanzerDiscFE_closure_model_MPI_1 .................................   Passed    1.51 sec
        Start  28: PanzerDiscFE_boundary_condition_MPI_1
 34/132 Test  #28: PanzerDiscFE_boundary_condition_MPI_1 ............................   Passed    1.06 sec
        Start  43: PanzerDiscFE_equation_set_composite_factory_MPI_1
 35/132 Test  #43: PanzerDiscFE_equation_set_composite_factory_MPI_1 ................   Passed    1.60 sec
        Start  37: PanzerDiscFE_parameter_library_MPI_1
 36/132 Test  #99: PanzerAdaptersSTK_periodic_mesh_MPI_2 ............................   Passed   10.90 sec
        Start  64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2
 37/132 Test  #37: PanzerDiscFE_parameter_library_MPI_1 .............................   Passed    0.90 sec
        Start  45: PanzerDiscFE_closure_model_composite_MPI_1
 38/132 Test  #45: PanzerDiscFE_closure_model_composite_MPI_1 .......................   Passed    1.09 sec
        Start  40: PanzerDiscFE_view_factory_MPI_1
 39/132 Test  #40: PanzerDiscFE_view_factory_MPI_1 ..................................   Passed    1.60 sec
        Start  30: PanzerDiscFE_stlmap_utilities_MPI_1
 40/132 Test  #30: PanzerDiscFE_stlmap_utilities_MPI_1 ..............................   Passed    1.60 sec
        Start  39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1
 41/132 Test  #64: PanzerAdaptersSTK_tCubeTetMeshFactory_MPI_2 ......................   Passed    6.51 sec
 42/132 Test  #39: PanzerDiscFE_parameter_list_acceptance_test_MPI_1 ................   Passed    1.50 sec
        Start 111: PanzerAdaptersSTK_tFaceToElem_MPI_2
        Start  42: PanzerDiscFE_equation_set_MPI_1
 43/132 Test  #42: PanzerDiscFE_equation_set_MPI_1 ..................................   Passed    1.80 sec
 44/132 Test #117: PanzerAdaptersSTK_PoissonInterfaceExample_2d_diffsideids_MPI_1 ...   Passed  288.36 sec
        Start 102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2
 45/132 Test #111: PanzerAdaptersSTK_tFaceToElem_MPI_2 ..............................   Passed    9.02 sec
        Start  83: PanzerAdaptersSTK_workset_container_MPI_2
 46/132 Test  #83: PanzerAdaptersSTK_workset_container_MPI_2 ........................   Passed    5.43 sec
        Start  61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2
 47/132 Test #102: PanzerAdaptersSTK_STK_ResponseLibraryTest2_MPI_2 .................   Passed   12.45 sec
        Start  89: PanzerAdaptersSTK_simple_bc_MPI_2
 48/132 Test  #61: PanzerAdaptersSTK_tSquareQuadMeshFactory_MPI_2 ...................   Passed    5.31 sec
        Start 110: PanzerAdaptersSTK_node_normals_MPI_2
 49/132 Test  #89: PanzerAdaptersSTK_simple_bc_MPI_2 ................................   Passed   11.02 sec
        Start 103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2
 50/132 Test #115: PanzerAdaptersSTK_MixedPoissonExample ............................   Passed   37.17 sec
        Start 124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp
 51/132 Test #110: PanzerAdaptersSTK_node_normals_MPI_2 .............................   Passed    8.22 sec
        Start  74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2
 52/132 Test #103: PanzerAdaptersSTK_STK_VolumeSideResponse_MPI_2 ...................   Passed    5.92 sec
        Start  86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2
 53/132 Test  #74: PanzerAdaptersSTK_tCubeHexMeshDOFManager_MPI_2 ...................   Passed    6.11 sec
        Start  67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2
 54/132 Test  #86: PanzerAdaptersSTK_initial_condition_builder2_MPI_2 ...............   Passed    6.01 sec
        Start  70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2
 55/132 Test  #67: PanzerAdaptersSTK_tExodusReaderFactory_MPI_2 .....................   Passed    4.66 sec
        Start  87: PanzerAdaptersSTK_initial_condition_control_MPI_2
 56/132 Test  #87: PanzerAdaptersSTK_initial_condition_control_MPI_2 ................   Passed    5.21 sec
        Start 104: PanzerAdaptersSTK_ip_coordinates_MPI_2
 57/132 Test #104: PanzerAdaptersSTK_ip_coordinates_MPI_2 ...........................   Passed    6.21 sec
        Start  79: PanzerAdaptersSTK_d_workset_builder_MPI_2
 58/132 Test  #70: PanzerAdaptersSTK_tSquareQuadMeshDOFManager_MPI_2 ................   Passed   13.88 sec
        Start 105: PanzerAdaptersSTK_tGatherSolution_MPI_2
 59/132 Test  #79: PanzerAdaptersSTK_d_workset_builder_MPI_2 ........................   Passed    6.51 sec
        Start 107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2
 60/132 Test #105: PanzerAdaptersSTK_tGatherSolution_MPI_2 ..........................   Passed    5.63 sec
        Start  69: PanzerAdaptersSTK_tSTKConnManager_MPI_2
 61/132 Test  #69: PanzerAdaptersSTK_tSTKConnManager_MPI_2 ..........................   Passed    6.21 sec
        Start  72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2
 62/132 Test #107: PanzerAdaptersSTK_tScatterDirichletResidual_MPI_2 ................   Passed   10.44 sec
        Start  62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2
 63/132 Test  #72: PanzerAdaptersSTK_tSquareTriMeshDOFManager_MPI_2 .................   Passed    5.55 sec
        Start 106: PanzerAdaptersSTK_tScatterResidual_MPI_2
 64/132 Test  #62: PanzerAdaptersSTK_tSquareTriMeshFactory_MPI_2 ....................   Passed    4.35 sec
        Start  71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2
 65/132 Test #106: PanzerAdaptersSTK_tScatterResidual_MPI_2 .........................   Passed    7.01 sec
        Start  60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2
 66/132 Test  #71: PanzerAdaptersSTK_tDOFManager2_Orientation_MPI_2 .................   Passed    8.32 sec
        Start  73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2
 67/132 Test  #60: PanzerAdaptersSTK_tLineMeshFactory_MPI_2 .........................   Passed    4.31 sec
        Start  81: PanzerAdaptersSTK_cascade_MPI_2
 68/132 Test  #73: PanzerAdaptersSTK_tEpetraLinObjFactory_MPI_2 .....................   Passed    5.51 sec
 69/132 Test  #81: PanzerAdaptersSTK_cascade_MPI_2 ..................................   Passed    4.31 sec
        Start 114: PanzerAdaptersSTK_CurlLaplacianExample
 70/132 Test #114: PanzerAdaptersSTK_CurlLaplacianExample ...........................   Passed   81.47 sec
        Start  92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4
 71/132 Test  #92: PanzerAdaptersSTK_thyra_model_evaluator_MPI_4 ....................   Passed  124.12 sec
        Start  88: PanzerAdaptersSTK_assembly_engine_MPI_4
 72/132 Test #118: PanzerAdaptersSTK_PoissonInterfaceExample_3d_MPI_4 ...............   Passed  665.81 sec
        Start 127: PanzerAdaptersSTK_main_driver_energy-transient-blocked
 73/132 Test #116: PanzerAdaptersSTK_PoissonInterfaceExample_2d_MPI_4 ...............   Passed  705.88 sec
        Start 123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue
 74/132 Test #124: PanzerAdaptersSTK_main_driver_energy-ss-blocked-tp ...............   Passed  561.85 sec
        Start 121: PanzerAdaptersSTK_main_driver_energy-transient
 75/132 Test #127: PanzerAdaptersSTK_main_driver_energy-transient-blocked ...........   Passed  298.22 sec
        Start 122: PanzerAdaptersSTK_main_driver_energy-ss-blocked
 76/132 Test #121: PanzerAdaptersSTK_main_driver_energy-transient ...................   Passed  146.27 sec
        Start  77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4
 77/132 Test  #77: PanzerAdaptersSTK_tDOFManager2_SimpleTests_MPI_4 .................   Passed   24.14 sec
        Start  90: PanzerAdaptersSTK_model_evaluator_MPI_4
 78/132 Test #122: PanzerAdaptersSTK_main_driver_energy-ss-blocked ..................   Passed   85.07 sec
        Start 119: PanzerAdaptersSTK_assembly_example_MPI_4
 79/132 Test  #88: PanzerAdaptersSTK_assembly_engine_MPI_4 ..........................   Passed  478.97 sec
        Start  95: PanzerAdaptersSTK_solver_MPI_4
 80/132 Test #119: PanzerAdaptersSTK_assembly_example_MPI_4 .........................   Passed   12.67 sec
        Start  98: PanzerAdaptersSTK_periodic_bcs_MPI_4
 81/132 Test #123: PanzerAdaptersSTK_main_driver_energy-ss-loca-eigenvalue ..........   Passed  363.45 sec
        Start 128: PanzerAdaptersSTK_me_main_driver_energy-ss
 82/132 Test  #90: PanzerAdaptersSTK_model_evaluator_MPI_4 ..........................   Passed   25.13 sec
        Start 129: PanzerAdaptersSTK_siamCse17_MPI_4
 83/132 Test #128: PanzerAdaptersSTK_me_main_driver_energy-ss .......................   Passed    8.99 sec
        Start  93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4
 84/132 Test  #98: PanzerAdaptersSTK_periodic_bcs_MPI_4 .............................   Passed   18.26 sec
        Start  21: PanzerDofMgr_scaling_test
 85/132 Test #129: PanzerAdaptersSTK_siamCse17_MPI_4 ................................   Passed    9.42 sec
        Start 112: PanzerAdaptersSTK_square_mesh_MPI_4
 86/132 Test  #95: PanzerAdaptersSTK_solver_MPI_4 ...................................   Passed   26.74 sec
        Start  68: PanzerAdaptersSTK_tGhosting_MPI_4
 87/132 Test #112: PanzerAdaptersSTK_square_mesh_MPI_4 ..............................   Passed    4.01 sec
        Start  65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4
 88/132 Test  #68: PanzerAdaptersSTK_tGhosting_MPI_4 ................................   Passed    4.09 sec
        Start 113: PanzerAdaptersSTK_square_mesh_bc_MPI_4
 89/132 Test  #93: PanzerAdaptersSTK_explicit_model_evaluator_MPI_4 .................   Passed   11.03 sec
 90/132 Test  #65: PanzerAdaptersSTK_tSingleBlockCubeHexMeshFactory_MPI_4 ...........   Passed    3.38 sec
        Start  76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2
        Start 131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2
        Start 132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3
        Start  31: PanzerDiscFE_shards_utilities_MPI_1
 91/132 Test  #31: PanzerDiscFE_shards_utilities_MPI_1 ..............................   Passed    1.10 sec
        Start  58: PanzerDiscFE_point_descriptor_MPI_1
 92/132 Test  #58: PanzerDiscFE_point_descriptor_MPI_1 ..............................   Passed    1.20 sec
        Start  36: PanzerDiscFE_global_data_MPI_1
 93/132 Test #113: PanzerAdaptersSTK_square_mesh_bc_MPI_4 ...........................   Passed    4.30 sec
 94/132 Test  #36: PanzerDiscFE_global_data_MPI_1 ...................................   Passed    1.30 sec
        Start  50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2
        Start  51: PanzerDiscFE_tCloneLOF_MPI_2
        Start  25: PanzerDiscFE_basis_MPI_1
 95/132 Test  #76: PanzerAdaptersSTK_tBlockedDOFManagerFactory_MPI_2 ................   Passed    4.62 sec
        Start  52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2
 96/132 Test #131: PanzerAdaptersIOSS_tIOSSConnManager2_MPI_2 .......................   Passed    5.33 sec
 97/132 Test #132: PanzerAdaptersIOSS_tIOSSConnManager3_MPI_3 .......................   Passed    5.32 sec
        Start  48: PanzerDiscFE_tEpetraScatter_MPI_4
        Start  29: PanzerDiscFE_material_model_entry_MPI_1
 98/132 Test  #25: PanzerDiscFE_basis_MPI_1 .........................................   Passed    1.11 sec
        Start  41: PanzerDiscFE_check_bc_consistency_MPI_1
 99/132 Test  #29: PanzerDiscFE_material_model_entry_MPI_1 ..........................   Passed    1.21 sec
        Start  38: PanzerDiscFE_cell_topology_info_MPI_1
100/132 Test  #48: PanzerDiscFE_tEpetraScatter_MPI_4 ................................   Passed    1.51 sec
        Start  47: PanzerDiscFE_tEpetraGather_MPI_4
101/132 Test  #41: PanzerDiscFE_check_bc_consistency_MPI_1 ..........................   Passed    1.31 sec
102/132 Test  #38: PanzerDiscFE_cell_topology_info_MPI_1 ............................   Passed    1.11 sec
        Start  35: PanzerDiscFE_output_stream_MPI_1
        Start  24: PanzerDiscFE_dimension_MPI_1
103/132 Test  #52: PanzerDiscFE_tEpetra_LOF_FilteredUGI_MPI_2 .......................   Passed    3.62 sec
104/132 Test  #47: PanzerDiscFE_tEpetraGather_MPI_4 .................................   Passed    1.32 sec
105/132 Test  #24: PanzerDiscFE_dimension_MPI_1 .....................................   Passed    1.12 sec
        Start  49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4
        Start  33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1
        Start  34: PanzerDiscFE_zero_sensitivities_MPI_1
        Start  12: PanzerDofMgr_tOrientations_MPI_1
106/132 Test  #35: PanzerDiscFE_output_stream_MPI_1 .................................   Passed    1.53 sec
107/132 Test  #51: PanzerDiscFE_tCloneLOF_MPI_2 .....................................   Passed    5.05 sec
        Start  10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2
        Start   1: PanzerCore_version_MPI_1
108/132 Test   #1: PanzerCore_version_MPI_1 .........................................   Passed    0.60 sec
109/132 Test  #12: PanzerDofMgr_tOrientations_MPI_1 .................................   Passed    1.21 sec
        Start  11: PanzerDofMgr_tBlockedDOFManager_MPI_2
110/132 Test  #49: PanzerDiscFE_tEpetraScatterDirichlet_MPI_4 .......................   Passed    1.32 sec
111/132 Test  #34: PanzerDiscFE_zero_sensitivities_MPI_1 ............................   Passed    1.31 sec
112/132 Test  #33: PanzerDiscFE_element_block_to_physics_block_map_MPI_1 ............   Passed    1.31 sec
        Start  14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4
        Start  13: PanzerDofMgr_tFilteredUGI_MPI_2
113/132 Test  #14: PanzerDofMgr_tCartesianDOFMgr_DynRankView_MPI_4 ..................   Passed    3.71 sec
        Start  20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4
114/132 Test  #13: PanzerDofMgr_tFilteredUGI_MPI_2 ..................................   Passed    6.71 sec
        Start   2: PanzerCore_string_utilities_MPI_1
        Start   3: PanzerCore_hash_utilities_MPI_1
115/132 Test   #3: PanzerCore_hash_utilities_MPI_1 ..................................   Passed    0.60 sec
        Start   4: PanzerCore_memUtils_MPI_1
116/132 Test  #20: PanzerDofMgr_tFieldAggPattern_DG_MPI_4 ...........................   Passed    3.61 sec
117/132 Test   #2: PanzerCore_string_utilities_MPI_1 ................................   Passed    0.71 sec
118/132 Test  #11: PanzerDofMgr_tBlockedDOFManager_MPI_2 ............................   Passed    8.33 sec
119/132 Test   #4: PanzerCore_memUtils_MPI_1 ........................................   Passed    0.30 sec
        Start  16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4
        Start  15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4
120/132 Test  #10: PanzerDofMgr_tUniqueGlobalIndexerUtilities_MPI_2 .................   Passed    9.64 sec
121/132 Test  #16: PanzerDofMgr_tCartesianDOFMgr_DG_MPI_4 ...........................   Passed    3.48 sec
        Start   6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4
122/132 Test   #6: PanzerDofMgr_tGeometricAggFieldPattern_MPI_4 .....................   Passed    1.30 sec
        Start   9: PanzerDofMgr_tFieldAggPattern_MPI_4
123/132 Test  #15: PanzerDofMgr_tCartesianDOFMgr_HighOrder_MPI_4 ....................   Passed    5.89 sec
        Start   7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4
124/132 Test   #9: PanzerDofMgr_tFieldAggPattern_MPI_4 ..............................   Passed    1.21 sec
        Start   8: PanzerDofMgr_tNodalFieldPattern_MPI_4
125/132 Test   #7: PanzerDofMgr_tIntrepidFieldPattern_MPI_4 .........................   Passed    1.31 sec
        Start  19: PanzerDofMgr_tFieldAggPattern2_MPI_4
126/132 Test   #8: PanzerDofMgr_tNodalFieldPattern_MPI_4 ............................   Passed    0.83 sec
        Start  17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4
127/132 Test  #50: PanzerDiscFE_LinearObjFactory_Tests_MPI_2 ........................   Passed   21.72 sec
        Start   5: PanzerDofMgr_tFieldPattern_MPI_4
128/132 Test  #19: PanzerDofMgr_tFieldAggPattern2_MPI_4 .............................   Passed    1.30 sec
        Start  18: PanzerDofMgr_tFieldPattern2_MPI_4
129/132 Test  #17: PanzerDofMgr_tGeometricAggFieldPattern2_MPI_4 ....................   Passed    1.17 sec
130/132 Test   #5: PanzerDofMgr_tFieldPattern_MPI_4 .................................   Passed    1.07 sec
131/132 Test  #18: PanzerDofMgr_tFieldPattern2_MPI_4 ................................   Passed    0.91 sec
132/132 Test  #21: PanzerDofMgr_scaling_test ........................................   Passed   37.91 sec

100% tests passed, 0 tests failed out of 132

Label Time Summary:
Panzer    = 5154.83 sec (132 tests)

Total Test time (real) = 1118.57 sec

@ibaned
Copy link
Contributor

ibaned commented Mar 20, 2018

@dsunder would be valuable to have in this conversation

@rppawlo
Copy link
Contributor

rppawlo commented Mar 20, 2018

Here's the configure:

build_drekar.txt

@bartlettroscoe
Copy link
Member Author

Shoot, it looks like the problem with trying to run multiple MPI jobs at the same time and slowing each other down may also be a problem with CUDA on GPUs as well, as described in #2446 (comment).

@nmhamster, we really need to figure out how to even just manage running multiple mpi jobs on the same nodes at the same time and not have them step on each other.

@bartlettroscoe
Copy link
Member Author

CC: @rppawlo, @ambrad, @nmhamster

As described in #2446 (comment), it seems that indeed the Trilinos test suite using Kokkos on the GPU does not allow the tests to be run in parallel either. I think this increases the importance of this story to get this fixed once and for all.

@mhoemmen
Copy link
Contributor

@bartlettroscoe Do we not run the MPS server on the test machines, to let multiple MPI processes share the GPU?

@bartlettroscoe
Copy link
Member Author

Do we not run the MPS server on the test machines, to let multiple MPI processes share the GPU?

@mhoemmen, not that I know of. There is no mention of a MPS server in any of the documentation that I can find in the files:

  • hansen:/opt/HANSEN_INTRO
  • white:/opt/WHITE_INTRO

I think this is really a question for the test beds team.

@nmhamster, do you know if the Test Bed team has any plans to set up an MPS server to manage this issue on any of the Test Bed machines?

@bartlettroscoe
Copy link
Member Author

It looks like even non-threaded tests can't run in parallel of each other without slowing each other down as was demonstrated for the ATDM gnu-opt-serial build in #2455 (comment). In that experiment, the test Anasazi_Epetra_BlockDavidson_auxtest_MPI_4 completed in 119 seconds when run alone but took 760 seconds to complete when run with ctest -j8 on 'hansen'.

We really need to start experimenting with the update ctest program in 'master' that has the process affinity property.

@ibaned
Copy link
Contributor

ibaned commented Mar 27, 2018

@bartlettroscoe is it possible to get a detailed description of what this new process affinity feature in CMake does?

@bartlettroscoe
Copy link
Member Author

is it possible to get a detailed description of what this new process affinity feature in CMake does?

@ibaned,

We will need to talk with Brad King at Kitware. Otherwise, you and get more info by looking at:

(if you don't have access yet let me know and I can get you access).

@bartlettroscoe
Copy link
Member Author

FYI: As pointed out by @etphipp in #2628 (comment), setting:

 -D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none"

seems to fix the problem of OpenMP threads all binding to the same core on a RHEL6 machine.

Could this be a short-term solution to the problem of setting up automated builds of Trilinos with OpenMP enabled?

@ibaned
Copy link
Contributor

ibaned commented Apr 25, 2018

@bartlettroscoe yes, thats a step in the right direction. Threads will at least be able to use all the cores, although they will move around and threads from different jobs will compete if using ctest -j. Still, you should get semi-decent results from this. I recommend dividing the argument to ctest -j by the number of threads per process. In fact I think --bind-to none is the best way to go until we have direct support in CMake for binding with ctest -j.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Apr 25, 2018

I recommend dividing the argument to ctest -j by the number of threads per process. In fact I think --bind-to none is the best way to go until we have direct support in CMake for binding with ctest -j.

@ibaned, that is basically what we have been doing up till now in the ATDM Trilinos builds and that is consistent with how we have set the CTest PROCESSORS property. The full scope of this current Issue is to tell ctest about the total number of threads to be used and use the updated version of CMake/CTest that can set process affinity correctly.

When I get some free time on my local RHEL6 machine, I will try enabling OpenMP and setting -D MPI_EXEC_PRE_NUMPROCS_FLAGS="--bind-to;none" and then running the entire test suite for PT packages in Trilinos for the GCC 4.8.4 and Intel 17.0.1 builds and see what that looks like.

@jwillenbring
Copy link
Member

@prwolfe We spoke today about updating the arguments for the GCC PR testing builds. When we do, and add OpenMP to one of them, we should use the argument described above.

@prwolfe
Copy link
Contributor

prwolfe commented Apr 25, 2018

Hmm, had not started OpenMP yet, but that would be good.

bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This was the agreement as part trilinos#2317.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This was the agreement as part trilinos#2317.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to bartlettroscoe/Trilinos that referenced this issue May 4, 2018
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462.

NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads
in differnet MPI ranks to the same cores.  See trilinos#2422.
bartlettroscoe added a commit to KyleFromKitware/Trilinos that referenced this issue May 20, 2020
…rilinos#2422)

This is using a special TriBITS-patched version of CMake 3.17.2.

This should spread things out a little better over the GPUs.
bartlettroscoe added a commit to KyleFromKitware/Trilinos that referenced this issue May 20, 2020
This will reduce the number of timeouts and seems to run almost as fast due to
problems with contention for the GPUs.
bartlettroscoe added a commit to KyleFromKitware/Trilinos that referenced this issue May 20, 2020
bartlettroscoe added a commit to KyleFromKitware/Trilinos that referenced this issue May 20, 2020
This also switches to patched CMake 3.17.2 which is needed to support this
feature.
bartlettroscoe added a commit that referenced this issue May 20, 2020
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue May 22, 2020
…s:develop' (cd0d4eb).

* trilinos-develop:
  Piro: cleaning/refactoring of steady adjoint sensitivities       moved computation of adjoint sensitivities       from Piro::NOXSolver into Piro::SteadyStateSolver
  ATDM: Set several CUDA disables (trilinos#6329, trilinos#6799, trilinos#7090)
  Tempus: Add TimeEvent Object
  amesos2: fix tests, examples with basker, cleanup
  amesos2/basker: fix memory leak
  Phalanx: remove all use of cuda uvm
  ATDM: ride: Spread out work over GPUs (trilinos#2422)
  Kokkos: Switch to use Kokkos::Cuda().cuda_device() for expected_device (kokkos/kokkos#3040, trilinos#6840)
  Kokkos: Extract and use get_gpu() (kokkos/kokkos#3040, trilinos#6840)
  ATDM: Update documentation for updated 'waterman' env (trilinos#2422)
  ATDM: waterman: Reduce from ctest -j4 to -j2 (trilinos#2422)
  ATDM: waterman: Use cmake 3.17.2 and ctest resource limits for GPUs (trilinos#2422)
  Allow pointing to a tribits outside of Trilinos (trilinos#2422)
  Automatic snapshot commit from tribits at 39a9591
jmgate pushed a commit to tcad-charon/Trilinos that referenced this issue May 22, 2020
…s:develop' (cd0d4eb).

* trilinos-develop:
  Piro: cleaning/refactoring of steady adjoint sensitivities       moved computation of adjoint sensitivities       from Piro::NOXSolver into Piro::SteadyStateSolver
  ATDM: Set several CUDA disables (trilinos#6329, trilinos#6799, trilinos#7090)
  Tempus: Add TimeEvent Object
  amesos2: fix tests, examples with basker, cleanup
  amesos2/basker: fix memory leak
  Phalanx: remove all use of cuda uvm
  ATDM: ride: Spread out work over GPUs (trilinos#2422)
  Kokkos: Switch to use Kokkos::Cuda().cuda_device() for expected_device (kokkos/kokkos#3040, trilinos#6840)
  Kokkos: Extract and use get_gpu() (kokkos/kokkos#3040, trilinos#6840)
  ATDM: Update documentation for updated 'waterman' env (trilinos#2422)
  ATDM: waterman: Reduce from ctest -j4 to -j2 (trilinos#2422)
  ATDM: waterman: Use cmake 3.17.2 and ctest resource limits for GPUs (trilinos#2422)
  Allow pointing to a tribits outside of Trilinos (trilinos#2422)
  Automatic snapshot commit from tribits at 39a9591
@bartlettroscoe
Copy link
Member Author

CC: @KyleFromKitware

@jjellio, continuing from the discussion started in kokkos/kokkos#3040, I did timing of the Trilinos test suite with a CUDA build on 'vortex' for the 'ats2' env and I found that raw 'jsrun' does not spread out over the 4 GPUs on a node on that system automatically. However, when I switched over to the new CTest GPU allocation approach in commit 692e990 as part of PR #7427, I got perfect scalability of the TpetraCore_gemm tests up to ctest -j4. See the details in PR #7427. I also repeat the timing experiments done for that PR branch below.

Details: (click to expand)

A) Running some timing experiments with the TpetraCore_gemm tests without ctest GPU allocation (just raw 'jsrun' behavior on one node):

$ bsub -W 6:00 -Is bash
Job <196758> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on vortex59>>
Don't load the sems modules on 'vortex'!

$ cd /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/

$ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt
Hostname 'vortex59' matches known ATDM host 'vortex59' and system 'ats2'
Setting compiler and build options for build-name 'Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt'
Using ats2 compiler stack CUDA-10.1.243_GNU-7.3.1_SPMPI-ROLLING to build RELEASE code with Kokkos node type CUDA

$ for n in 1 2 4 8 ; do echo ; echo ; echo "time ctest -j $n -R TpetraCore_gemm" ; time ctest -j $n -R TpetraCore_gemm ; done

time ctest -j 1 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed   17.93 sec
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
2/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed   17.63 sec
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed   17.69 sec
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed   17.56 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    =  70.81 sec*proc (4 tests)

Total Test time (real) =  70.91 sec

real    1m10.921s
user    0m0.854s
sys     0m0.325s


time ctest -j 2 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed   91.98 sec
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
2/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed   91.98 sec
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
3/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed   91.53 sec
4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed   91.58 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    = 367.07 sec*proc (4 tests)

Total Test time (real) = 183.66 sec

real    3m3.669s
user    0m0.750s
sys     0m0.461s


time ctest -j 4 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
1/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed  196.77 sec
2/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed  196.88 sec
3/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed  196.92 sec
4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed  197.04 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    = 787.60 sec*proc (4 tests)

Total Test time (real) = 197.15 sec

real    3m17.154s
user    0m0.706s
sys     0m0.633s


time ctest -j 8 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
1/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed  196.97 sec
2/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed  196.98 sec
3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed  196.98 sec
4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed  197.08 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    = 788.01 sec*proc (4 tests)

Total Test time (real) = 197.19 sec

real    3m17.195s
user    0m0.738s
sys     0m0.555s

Wow, that is terrible anti-speedup.


B) Now to test running the TpetraCore_gemm_ tests again with CTest GPU allocation approach with different ctest -j<N> levels:

$ bsub -W 6:00 -Is bash
Job <199125> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on vortex59>>
Don't load the sems modules on 'vortex'!

$ cd /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra/

$ . ~/Trilinos.base/Trilinos/cmake/std/atdm/load-env.sh Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt
Hostname 'vortex59' matches known ATDM host 'vortex59' and system 'ats2'
Setting compiler and build options for build-name 'Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt'
Using ats2 compiler stack CUDA-10.1.243_GNU-7.3.1_SPMPI-ROLLING to build RELEASE code with Kokkos node type CUDA

$ for n in 1 2 4 8 ; do echo ; echo ; echo "time ctest -j $n -R TpetraCore_gemm" ; time ctest -j $n -R TpetraCore_gemm ; done


time ctest -j 1 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed   12.30 sec
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
2/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed   11.92 sec
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed   12.36 sec
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed   11.99 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    =  48.57 sec*proc (4 tests)

Total Test time (real) =  48.75 sec

real    0m48.904s
user    0m0.807s
sys     0m0.362s


time ctest -j 2 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
1/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed   12.75 sec
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
2/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed   12.82 sec
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
3/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed   12.65 sec
4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed   12.90 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    =  51.13 sec*proc (4 tests)

Total Test time (real) =  25.75 sec

real    0m25.763s
user    0m0.762s
sys     0m0.445s


time ctest -j 4 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed   13.32 sec
2/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed   13.32 sec
3/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed   13.87 sec
4/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed   13.93 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    =  54.44 sec*proc (4 tests)

Total Test time (real) =  14.03 sec

real    0m14.036s
user    0m0.853s
sys     0m0.452s


time ctest -j 8 -R TpetraCore_gemm
Test project /vscratch1/rabartl/Trilinos.base/BUILDS/VORTEX/CTEST_S/Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt/SRC_AND_BUILD/BUILD/packages/tpetra
    Start 26: TpetraCore_gemm_m_eq_13_MPI_1
    Start 23: TpetraCore_gemm_m_eq_1_MPI_1
    Start 24: TpetraCore_gemm_m_eq_2_MPI_1
    Start 25: TpetraCore_gemm_m_eq_5_MPI_1
1/4 Test #23: TpetraCore_gemm_m_eq_1_MPI_1 .....   Passed   13.19 sec
2/4 Test #26: TpetraCore_gemm_m_eq_13_MPI_1 ....   Passed   13.26 sec
3/4 Test #25: TpetraCore_gemm_m_eq_5_MPI_1 .....   Passed   13.91 sec
4/4 Test #24: TpetraCore_gemm_m_eq_2_MPI_1 .....   Passed   14.60 sec

100% tests passed, 0 tests failed out of 4

Label Time Summary:
Tpetra    =  54.96 sec*proc (4 tests)

Total Test time (real) =  14.70 sec


real    0m14.710s
user    0m0.753s
sys     0m0.542s

Okay, so going from ctest -j4 to ctest -j8 is no different for these tests because there is just 4 of them. But it does prove that the ctest GPU allocation algorithm does do breath-first to allocate GPUs.

@jjellio
Copy link
Contributor

jjellio commented May 26, 2020

@bartlettroscoe It depends entirely on the flags you've given to JSRUN. The issue I've linked to shows it working. It hinges on resource sets. What jsrun lines are you using?

@bartlettroscoe
Copy link
Member Author

What jsrun lines are you using?

@jjellio, I believe the same one's being used by SPARC that these were copied form. See lines starting at:

Since the CTest GPU allocation method is working so well, I would be hesitant to change what is currently in PR #7204

@jjellio
Copy link
Contributor

jjellio commented May 26, 2020

Yep, and those options do not specify a gpu or binding options. The lines currently used on most platforms for Trilinos testing are chosen to oversubscribe a system to get throughput.

The flags I used back then were:

jsrun -r1 -a1 -c4 -g1 -brs
-r1 = 1 resource set
-a1 = 1 task per resource set. (so 1 resource set ... I get 1 tasks total)
-c4 = 4 cores per task
-g1 = 1 GPU per task
-brs = bind to resource set (so you get a process mask that isolates resource sets)

The problem is the those flags use -a, which forbids -p. It could be that -a is what made the difference, but I tend to think it was -g1 - spectrum needs to know you want a GPU.

The flags I'd use are:
export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs"
export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"

@bartlettroscoe
Copy link
Member Author

@jjellio,

When you say:

The lines currently used on most platforms for Trilinos testing are chosen to oversubscribe a system to get throughput.

who is doing this testing?

Otherwise, we have had problems with robustness when trying to oversubscribe on some systems (I would have to resource some).

@jjellio
Copy link
Contributor

jjellio commented May 26, 2020

So, I just ran on the ATS2 testbed (rzansel)

Using -r4 -c4 -g1 -brs -p1, the jobs are serialized: (look for 'PID XYZ started', if they are parallel, you'd expect 4 PIDs starting at the start)

jjellio@rzansel46:async]$ for i in $(seq 1 20); do jsrun -r4 -c4 -g1 -brs -p1 ./runner.sh & done
[1] 92455
[2] 92456
[3] 92457
[4] 92458
[5] 92459
[jjellio@rzansel46:async]$ PID 92758 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 26 seconds to waste time
Job jsrun started at Tue May 26 11:23:36 PDT 2020
            ended at Tue May 26 11:24:02 PDT 2020

PID 92853 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 20 seconds to waste time
Job jsrun started at Tue May 26 11:24:02 PDT 2020
            ended at Tue May 26 11:24:22 PDT 2020

PID 92925 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 17 seconds to waste time
Job jsrun started at Tue May 26 11:24:22 PDT 2020
            ended at Tue May 26 11:24:39 PDT 2020

PID 92963 has started!
Rank: 00 Local Rank: 0 rzansel46
  Cuda Devices: 0
  CPU list:  8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
  Sockets:  0   NUMA list:  0 
  Job will sleep 30 seconds to waste time
Job jsrun started at Tue May 26 11:24:39 PDT 2020
            ended at Tue May 26 11:25:09 PDT 2020

PID 93045 has started!

They become unserialized if you use -r1 -a1 -g1 -brs, that's pretty obnoxious. That line works in places of -p1.

So it would seem if you use:

export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs"
export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"

Plus the Kitware/Ctest stuff it should work fine. My only skin in this game is more headaches on ATS2... I don't need anymore headaches on ATS2.

@bartlettroscoe
Copy link
Member Author

So it would seem if you use:

export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs"
export ATDM_CONFIG_MPI_EXEC_NUMPROCS_FLAG="-p"

Plus the Kitware/Ctest stuff it should work fine.

What is the advantage of selecting those options over what is listed in atdm/ats2/environment.sh currently? What is broken that this is trying to fix?

My only skin in this game is more headaches on ATS2... I don't need anymore headaches on ATS2.

It is not just you and I, it is everyone running Trilinos tests on ATS-2. The ATDM Trilinos configuration should be the recommended way for people to run the Trilinos test suite on that system.

@jjellio
Copy link
Contributor

jjellio commented May 26, 2020

If you don't have a -g1 flag, then jsrun is not going to set Cuda visible devices.
If Kitware is going to manage the cuda device assignment, then it shouldn't matter.

-c4 -brs is processing binding. It keeps your job on XYZ cores, and prevents other jobs from being there. I recommend anything in the range 2 to 10. I've just found 4 to be a good number. If you don't have -cN you get exactly 1 hardware thread running your job - that isn't enough to keep the GPU happy and the host code happy. You need -c2 or more. Specify -brs makes -c operate on logical cores (not hardware threads), so -c2 -brs gives you substantially more compute resources. You need a few hardware threads to keep the Nvidia threads happy (they spawn 4).

TLDR: just use -r4 -c4 -g1 -brs.

As for oversubscription stuff:

How is the ctest work interacting with using KOKKOS_NUM_DEVICES or --kokkos-ndevices.

With jsrun, it sets CUDA_VISIBLE_DEVICES which makes kokkos-ndevices always see a zero device.

@jjellio
Copy link
Contributor

jjellio commented May 26, 2020

FYI, I had the same comment here:
#6724 (comment)

trilinos-autotester added a commit that referenced this issue May 28, 2020
…ts2-refactor

Automatically Merged using Trilinos Pull Request AutoTester
PR Title: ATDM: Address several 'ats2' issues (#7402, #7406, #7122, #2422)
PR Author: bartlettroscoe
@github-actions
Copy link

github-actions bot commented Jun 5, 2021

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 5, 2021
@bartlettroscoe bartlettroscoe removed the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 5, 2021
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 5, 2021

This is actually really close to getting done. We just need a tribits_add[_advanced]_test( ... NUM_THREADS_PER_PROC <numThreadsPerProc> ... ) argument for OpenMP builds and I think we have it. We have CUDA builds well covered now (at least for single node testing).

@github-actions
Copy link

github-actions bot commented Jun 8, 2022

This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity.
If you would like to keep this issue open please add a comment and/or remove the MARKED_FOR_CLOSURE label.
If this issue should be kept open even with no activity beyond the time limits you can add the label DO_NOT_AUTOCLOSE.
If it is ok for this issue to be closed, feel free to go ahead and close it. Please do not add any comments or change any labels or otherwise touch this issue unless your intention is to reset the inactivity counter for an additional year.

@github-actions github-actions bot added the MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. label Jun 8, 2022
@bartlettroscoe bartlettroscoe added DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. and removed MARKED_FOR_CLOSURE Issue or PR is marked for auto-closure by the GitHub Actions bot. labels Jun 8, 2022
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Jun 8, 2022

While this has been partially addressed with the CMake Resource management and GPU limiting, the full scope of this Story has not been addressed yet (see above).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM DevOps Issues that will be worked by the Coordinated ATDM DevOps teams client: ATDM Any issue primarily impacting the ATDM project DO_NOT_AUTOCLOSE This issue should be exempt from auto-closing by the GitHub Actions bot. PA: Framework Issues that fall under the Trilinos Framework Product Area pkg: Kokkos pkg: Tpetra type: enhancement Issue is an enhancement, not a bug
Projects
None yet
Development

No branches or pull requests

9 participants