Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KokkosCore_UnitTest_CudaInterOpStreams_MPI_1 failing in ATDM Trilinos builds starting before 2020-07-08 #8544

Closed
e10harvey opened this issue Jan 6, 2021 · 10 comments
Assignees
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Kokkos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@e10harvey
Copy link
Contributor

CC: @trilinos/kokkos, @crtrott (Trilinos Data Services Product Lead), @bartlettroscoe

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the tests:

  • KokkosCore_UnitTest_CudaInterOpStreams_MPI_1

in the builds:

  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_complex_static_opt_cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_dbg_cuda-aware-mpi
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt
  • Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_static_opt_cuda-aware-mpi
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release
  • Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug

started failing on testing day 2020-07-08.

All of the tests in debug builds show the following output like shown here:

Kokkos::Cuda ERROR: Failed to call Kokkos::Cuda::finalize()
unknown file: Failure
C++ exception with description "cudaGetLastError() error( cudaErrorInvalidResourceHandle): invalid resource handle
<snip>

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

One should be able to reproduce this failure as described in:

and the system-specific instructions at:

Just log into any of the associated machines and copy and paste the full CDash build name <build-name> listed above and run commands like:

$ cd <some_build_dir>/

$ source $TRILINOS_DIR/cmake/std/atdm/load-env.sh <build-name>

$ cmake \
 -GNinja \
 -DTrilinos_CONFIGURE_OPTIONS_FILE:STRING=cmake/std/atdm/ATDMDevEnv.cmake \
 -DTrilinos_ENABLE_TESTS=ON -DTrilinos_ENABLE_<package-name>=ON \
 $TRILINOS_DIR

$ make NP=16

$ <command-to-run-on-compute-node> ctest -j4

where <package-name> is any package that you want to enable to reproduce build and/or test results.

Again, for exact system-specific details on what commands to run to build and run tests, see:

If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands

@e10harvey e10harvey added type: bug The primary issue is a bug in Trilinos code or tests pkg: Kokkos impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) client: ATDM Any issue primarily impacting the ATDM project ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Jan 6, 2021
@grover-trilinos
Copy link

Test results for issue #8544 as of 2021-01-10

Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=4

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=6

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 14 10 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 14 9 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 12 12 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 10 14 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 12 12 #8544

Tests with issue trackers Failed: twif=4

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 9 14 10 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 12 12 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 2 11 13 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 8 15 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

@grover-trilinos
Copy link

Test results for issue #8544 as of 2021-01-17

Tests with issue trackers Passed: twip=6
Tests with issue trackers Failed: twif=4

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=6

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 2 13 11 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 14 9 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 2 11 13 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 1 10 14 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 11 13 #8544

Tests with issue trackers Failed: twif=4

Site Build Name Test Name Status Details Consec­utive Non-pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 4 12 12 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 10 13 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 2 12 10 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Failed Completed (Failed) 1 10 12 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

@crtrott
Copy link
Member

crtrott commented Jan 22, 2021

This test as well as the one mentioned in #8543 test interoperability with raw CUDA. In particular they test situations where CUDA is already used before Kokkos initialize and/or after Kokkos finalize. As such switching the GPU ID during Kokkos initialize will lead to the observed errors. One should NOT use any mechanism to tell Kokkos to choose a specific GPU. CUDA_VISIBLE_DEVICES probably works. In practice telling Kokkos to use device id 0 will also work (just not sure that CUDA guarantees that that is the default GPU).

@bartlettroscoe
Copy link
Member

One should NOT use any mechanism to tell Kokkos to choose a specific GPU. CUDA_VISIBLE_DEVICES probably works. In practice telling Kokkos to use device id 0 will also work (just not sure that CUDA guarantees that that is the default GPU).

@crtrott, that was not the appraoch/agreement we came to as part of:

Perhaps Kokkos needs to be updated to read in these CTest env vars earlier?

Changing to use CUDA_VISIBLE_DEVICES would require writing an intermediate wrapper in TriBITS for every test that read in the ctest-set env vars and set CUDA_VISIBLE_DEVICES accordingly. The design we came up with with for Ctest to not have to know about GPUs in particular and not have to modify TriBITS to coordinate the communication between CTest and Kokkos. But, again, we can extend TriBITS to do the needed translations (and perhaps we should) but that is just adding more control and complexity to TriBITS and making it a thicker wrapper of CMake/CTest.

@crtrott
Copy link
Member

crtrott commented Jan 22, 2021

With one should NOT use that mechanism: I mean specifically for those two tests. As I said I would recommend either disabling these two tests, or mark them as not runnable in parallel with other tests (is that a thing you can do?).

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jan 22, 2021

As I said I would recommend either disabling these two tests, or mark them as not runnable in parallel with other tests (is that a thing you can do?).

Yes and yes. For the former:

and for the latter:

As shown here, this test finished in less than 3s so I think we just need to add:

ATDM_SET_ENABLE(<fullTestName>_SET_RUN_SERIAL ON)

for each of these tests to:

right about here:

@bartlettroscoe
Copy link
Member

Need feedback from CDash before closing

@grover-trilinos
Copy link

Test results for issue #8544 as of 2021-01-24

Tests with issue trackers Passed: twip=4

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=4

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 2 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 7 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 4 8 16 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 3 10 13 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

@grover-trilinos
Copy link

Test results for issue #8544 as of 2021-01-31

Tests with issue trackers Passed: twip=10

Detailed test results: (click to expand)

Tests with issue trackers Passed: twip=10

Site Build Name Test Name Status Details Consec­utive Pass Days Non-pass Last 30 Days Pass Last 30 Days Issue Tracker
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 12 14 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­complex_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 9 17 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 11 13 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­dbg_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 12 6 18 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 6 17 #8544
vortex Trilinos-atdm-ats2-cuda-10.1.243-gnu-7.3.1-spmpi-rolling_­static_­opt_­cuda-aware-mpi KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 7 10 13 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 9 8 20 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-rdc-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 10 6 21 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release-debug KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 11 7 21 #8544
ride Trilinos-atdm-white-ride-cuda-9.2-gnu-7.2.0-release KokkosCore_­UnitTest_­CudaInterOpStreams_­MPI_­1 Passed Completed 10 8 19 #8544

This is an automated comment generated by Grover. Each week, Grover collates and reports data from CDash in an automated way to make it easier for developers to stay on top of their issues. Grover saw that there are tests being tracked on CDash that are associated with this open issue. If you have a question, please reach out to Ross. I'm just a cat.

@e10harvey
Copy link
Contributor Author

Closing as this has been passing since 01-23-2021 as shown in this query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ATDM Sev: Blocker Problems that make Trilinos unfit to be adopted by one or more ATDM APPs client: ATDM Any issue primarily impacting the ATDM project impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Kokkos type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants