-
Notifications
You must be signed in to change notification settings - Fork 573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define better strategy for managing threaded testing with Trilinos #2422
Comments
The creation of this issue came out of a conversation I had with @nmhamster last week about the issues with threaded testing of Trilinos and its impact on ATDM. The issue of testing the new ctest property |
Below is some detailed info from @rppawlo about how to reproduce the binding of threads in multiple MPI processes to the same core. Can you attach your complete
Below are the panzer test results for the corresponding configure file that I sent earlier. I ran with –E ConvTest to turn off the costly convergence tests. HWLOC was enabled for an OpenMP Kokkos build and mpi configured with:
I exported Without specifying the
Running with `ctest -E ConvTest` results (click to expand)
Running with `ctest -E ConvTest -j16` results (click to expand)
|
@dsunder would be valuable to have in this conversation |
Here's the configure: |
Shoot, it looks like the problem with trying to run multiple MPI jobs at the same time and slowing each other down may also be a problem with CUDA on GPUs as well, as described in #2446 (comment). @nmhamster, we really need to figure out how to even just manage running multiple mpi jobs on the same nodes at the same time and not have them step on each other. |
CC: @rppawlo, @ambrad, @nmhamster As described in #2446 (comment), it seems that indeed the Trilinos test suite using Kokkos on the GPU does not allow the tests to be run in parallel either. I think this increases the importance of this story to get this fixed once and for all. |
@bartlettroscoe Do we not run the MPS server on the test machines, to let multiple MPI processes share the GPU? |
@mhoemmen, not that I know of. There is no mention of a MPS server in any of the documentation that I can find in the files:
I think this is really a question for the test beds team. @nmhamster, do you know if the Test Bed team has any plans to set up an MPS server to manage this issue on any of the Test Bed machines? |
It looks like even non-threaded tests can't run in parallel of each other without slowing each other down as was demonstrated for the ATDM We really need to start experimenting with the update ctest program in 'master' that has the process affinity property. |
@bartlettroscoe is it possible to get a detailed description of what this new process affinity feature in CMake does? |
We will need to talk with Brad King at Kitware. Otherwise, you and get more info by looking at: (if you don't have access yet let me know and I can get you access). |
FYI: As pointed out by @etphipp in #2628 (comment), setting:
seems to fix the problem of OpenMP threads all binding to the same core on a RHEL6 machine. Could this be a short-term solution to the problem of setting up automated builds of Trilinos with OpenMP enabled? |
@bartlettroscoe yes, thats a step in the right direction. Threads will at least be able to use all the cores, although they will move around and threads from different jobs will compete if using |
@ibaned, that is basically what we have been doing up till now in the ATDM Trilinos builds and that is consistent with how we have set the CTest When I get some free time on my local RHEL6 machine, I will try enabling OpenMP and setting |
@prwolfe We spoke today about updating the arguments for the GCC PR testing builds. When we do, and add OpenMP to one of them, we should use the argument described above. |
Hmm, had not started OpenMP yet, but that would be good. |
This was the agreement as part trilinos#2317. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
This was the agreement as part trilinos#2317. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
This build matches settings targeted for the GCC 4.8.4 auto PR build in trilinos#2462. NOTE: This is using 'mpiexec --bind-to none ...' to avoid pinning the threads in differnet MPI ranks to the same cores. See trilinos#2422.
…rilinos#2422) This is using a special TriBITS-patched version of CMake 3.17.2. This should spread things out a little better over the GPUs.
This will reduce the number of timeouts and seems to run almost as fast due to problems with contention for the GPUs.
This also switches to patched CMake 3.17.2 which is needed to support this feature.
…s:develop' (cd0d4eb). * trilinos-develop: Piro: cleaning/refactoring of steady adjoint sensitivities moved computation of adjoint sensitivities from Piro::NOXSolver into Piro::SteadyStateSolver ATDM: Set several CUDA disables (trilinos#6329, trilinos#6799, trilinos#7090) Tempus: Add TimeEvent Object amesos2: fix tests, examples with basker, cleanup amesos2/basker: fix memory leak Phalanx: remove all use of cuda uvm ATDM: ride: Spread out work over GPUs (trilinos#2422) Kokkos: Switch to use Kokkos::Cuda().cuda_device() for expected_device (kokkos/kokkos#3040, trilinos#6840) Kokkos: Extract and use get_gpu() (kokkos/kokkos#3040, trilinos#6840) ATDM: Update documentation for updated 'waterman' env (trilinos#2422) ATDM: waterman: Reduce from ctest -j4 to -j2 (trilinos#2422) ATDM: waterman: Use cmake 3.17.2 and ctest resource limits for GPUs (trilinos#2422) Allow pointing to a tribits outside of Trilinos (trilinos#2422) Automatic snapshot commit from tribits at 39a9591
…s:develop' (cd0d4eb). * trilinos-develop: Piro: cleaning/refactoring of steady adjoint sensitivities moved computation of adjoint sensitivities from Piro::NOXSolver into Piro::SteadyStateSolver ATDM: Set several CUDA disables (trilinos#6329, trilinos#6799, trilinos#7090) Tempus: Add TimeEvent Object amesos2: fix tests, examples with basker, cleanup amesos2/basker: fix memory leak Phalanx: remove all use of cuda uvm ATDM: ride: Spread out work over GPUs (trilinos#2422) Kokkos: Switch to use Kokkos::Cuda().cuda_device() for expected_device (kokkos/kokkos#3040, trilinos#6840) Kokkos: Extract and use get_gpu() (kokkos/kokkos#3040, trilinos#6840) ATDM: Update documentation for updated 'waterman' env (trilinos#2422) ATDM: waterman: Reduce from ctest -j4 to -j2 (trilinos#2422) ATDM: waterman: Use cmake 3.17.2 and ctest resource limits for GPUs (trilinos#2422) Allow pointing to a tribits outside of Trilinos (trilinos#2422) Automatic snapshot commit from tribits at 39a9591
CC: @KyleFromKitware @jjellio, continuing from the discussion started in kokkos/kokkos#3040, I did timing of the Trilinos test suite with a CUDA build on 'vortex' for the 'ats2' env and I found that raw 'jsrun' does not spread out over the 4 GPUs on a node on that system automatically. However, when I switched over to the new CTest GPU allocation approach in commit 692e990 as part of PR #7427, I got perfect scalability of the TpetraCore_gemm tests up to Details: (click to expand)A) Running some timing experiments with the TpetraCore_gemm tests without ctest GPU allocation (just raw 'jsrun' behavior on one node):
Wow, that is terrible anti-speedup. B) Now to test running the TpetraCore_gemm_ tests again with CTest GPU allocation approach with different
Okay, so going from |
@bartlettroscoe It depends entirely on the flags you've given to JSRUN. The issue I've linked to shows it working. It hinges on resource sets. What jsrun lines are you using? |
@jjellio, I believe the same one's being used by SPARC that these were copied form. See lines starting at:
Since the CTest GPU allocation method is working so well, I would be hesitant to change what is currently in PR #7204 |
Yep, and those options do not specify a gpu or binding options. The lines currently used on most platforms for Trilinos testing are chosen to oversubscribe a system to get throughput. The flags I used back then were:
The problem is the those flags use The flags I'd use are: |
When you say:
who is doing this testing? Otherwise, we have had problems with robustness when trying to oversubscribe on some systems (I would have to resource some). |
So, I just ran on the ATS2 testbed (rzansel) Using
They become unserialized if you use So it would seem if you use: export ATDM_CONFIG_MPI_POST_FLAGS="-r;4,-c;4;-g;1;-brs" Plus the Kitware/Ctest stuff it should work fine. My only skin in this game is more headaches on ATS2... I don't need anymore headaches on ATS2. |
What is the advantage of selecting those options over what is listed in
It is not just you and I, it is everyone running Trilinos tests on ATS-2. The ATDM Trilinos configuration should be the recommended way for people to run the Trilinos test suite on that system. |
If you don't have a
TLDR: just use As for oversubscription stuff: How is the ctest work interacting with using KOKKOS_NUM_DEVICES or --kokkos-ndevices. With jsrun, it sets CUDA_VISIBLE_DEVICES which makes kokkos-ndevices always see a zero device. |
FYI, I had the same comment here: |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This is actually really close to getting done. We just need a |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
While this has been partially addressed with the CMake Resource management and GPU limiting, the full scope of this Story has not been addressed yet (see above). |
CC:: @trilinos/framework, @trilinos/kokkos, @trilinos/tpetra, @nmhamster, @rppawlo
Next Action Status
Set up a meeting to discuss current status of threaded testing in Trilinos and some steps to try to address the issues ...
Description
It seems the testing strategy in Trilinos to test with threading is to build threaded code and then run all of the threaded tests with same number of threads (such as by setting
export OMP_NUM_THREADS=2
when using OpenMP ) and then running the test suite withctest -j<N>
with that fixed number of threads. But this approach, and testing with threads enabled in general, has some issues.First, with some configurations and systems, running with any
<N>
withctest -j<N>
will result in all of the test executables binding to the same threads on the same cores making things run very slowly such as described in #2398 (comment). A similar problem also occurs with CUDA builds with the various test processes running concurrently not spreading the load across the available GPUs (see #2446 (comment)).Second, even when one does not experience the above problem of binding to the same cores (which is not always a problem), this approach does not make very good usage the test machine because it assumes that every MPI process is multi-threaded with Kokkos, which is not true. Even when
OPM_NUM_THREADS > 1
, there are a lot of Trilinos tests that don't have any threaded code and even if ctest allocates room for 2 threads per MPI process, only one thread will be used. So this will result in not keeping many of the cores busy running code and therefore making in tests take longer to complete.The impact of the two problems above has many developers and many automated builds having to run with a small
ctest -j<N>
(i.e.gctest -j8
is used on many of the ATDM Trilinos builds) and therefore not utilizing many of the cores that are available. This results in the time to run the full test suite going up significantly. This negatively impacts developer productivity (because they have to wait longer to get feedback from running tests locally) and this wastes existing testing hardware and/or limiting the number of builds and the number of tests that can be run in a given testing day (which reduces the number of defect that we can catch and therefore costs Trilinos developers and users time and $$).Third, having to run the entire Trilinos test suite with a fixed number of threads like with
export OMP_NUM_THREADS=2
orexport OMP_NUM_THREADS=4
does not result in very good testing or results in very expensive testing having to run the entire Trilinos test suite multiple times. It has been observed that defects occur when some thread counts are used likeexport OMP_NUM_THREADS=5
, for example. This would be like having every MPI test in Trilinos run with exactly the same number of MPI processes which would not result in very good testing (which is not the case in Trilinos as several tests are run with different numbers of MPI processes).Ideas for Possible Solutions
First, ctest needs to be extended in order to inform it of the architecture of the system were it will be running tests. CTest needs to know the number of sockets per node, the number of cores per socket, the number of threads per node, and the number of nodes. We will also need to inform CTest about the number of MPI ranks vs. threads per MPI rank for each test (i.e. add a
THREAD_PERS_PROCESS
property in addition to aPROCESSORS
property).. With that type of information, ctest should be able to determine the binding of the different ranks in an MPI job that runs a test to specific cores on sockets on nodes. And we will need to find a way to communicate this information to the MPI jobs when they are run by ctest. I think this means adding the types of process process affinity and process placement like you see from modern MPI implementations (see https://github.com/open-mpi/ompi/wiki/ProcessAffinity and https://github.com/open-mpi/ompi/wiki/ProcessPlacement ). See this Kitware backlog item.Second, we should investigate how to add a
NUM_THREADS_PER_PROC <numThreadsPerProc>
argument to the TRIBITS_ADD_TEST() and TRIBITS_ADD_ADVANCED_TEST() commands. It would be good if this could be added directly to these TriBITS functions and provide some type of "plugin" system to allow us to define how the number of threads gets set when running the individual test. But the default TriBITS implementation could just computeNUM_TOTAL_CORES_USED <numTotalCoresUsed>
from<numThreadsPerProc> * <numMpiProcs>
.The specialization of this new TriBITS functionality for Kokkos/Trilinos would set the number of requested threads based on the enabled threading model known at configure time. For OpenMP, it would set the env var
OMP_NUM_THREADS=<numThreads>
and for other threading models would pass in--kokkos-threads=<numThreads>
. If theNUM_THREADS_PER_PROC <numThreadsPerProc>
argument was missing, this could use a default number of threads (e.g. global configure argumentTrilinos_DEFAULT_NUM_THREADS
with default1
). If the computed<numTotalCoresUsed>
was larger than${MPI_EXEC_MAX_NUMPROCS}
(which should be set to the maximum number of threads that can be used run on that machine when using threading), then the test would get excluded and a message would be printed to STDOUT. Such a CMake function should be pretty easy to write, if you know the threading model used by Kokkos at configure time.Definition of Done:
Tasks:
The text was updated successfully, but these errors were encountered: