Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stokhos: Test Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1 randomly failing in 'ats2' CUDA PR build on 'vortex' #11117

Closed
bartlettroscoe opened this issue Oct 6, 2022 · 7 comments
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Oct 6, 2022

CC: @trilinos/stokhos

Next Action Status

Description

As shown in this query (click "Shown Matching Output" in upper right) the tests:

  • Stokhos_KokkosViewUQPCEUnitTest_Serial_MPI_1

in the unique GenConfig builds:

  • ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables

started failing on testing day 2022-05-01.

The specific set of CDash builds impacted where:

  • PR-10472-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-911
  • PR-10472-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-915
  • PR-10571-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1113
  • PR-11086-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1182
  • PR-11099-test-ats2_cuda-10.1.243-gnu-8.3.1-spmpi-rolling_release_static_Volta70_Power9_no-asan_no-complex_no-fpic_mpi_pt_no-rdc_uvm_deprecated-on_no-package-enables-1211

When the test fails, it produces error output like shown here showing:

3. Kokkos_View_PCE_DS_LayoutLeft_DeepCopy_NonContiguous_UnitTest ... 
 val = 2.21341409336878452e-321 == val_expected = 1.01000000000000000e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
 val = 3.56221330651538758e-321 == val_expected = 1.01099999999999994e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
 val = 6.95327277181438017e-310 == val_expected = 1.01200000000000003e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
...
 val = 4.79243676466009148e-322 == val_expected = 1.02718181818181819e+02 : FAILED ==> /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:149
 val = 1.01909090909090907e+02 == val_expected = 1.01909090909090907e+02 : passed
 val = 1.02009090909090901e+02 == val_expected = 1.02009090909090901e+02 : passed
 val = 1.02109090909090909e+02 == val_expected = 1.02109090909090909e+02 : passed
 val = 1.02209090909090904e+02 == val_expected = 1.02209090909090904e+02 : passed
 val = 1.02309090909090912e+02 == val_expected = 1.02309090909090912e+02 : passed
 val = 1.02409090909090907e+02 == val_expected = 1.02409090909090907e+02 : passed
 val = 1.02509090909090901e+02 == val_expected = 1.02509090909090901e+02 : passed
 val = 1.02609090909090909e+02 == val_expected = 1.02609090909090909e+02 : passed
 val = 1.02709090909090904e+02 == val_expected = 1.02709090909090904e+02 : passed
 val = 1.02809090909090912e+02 == val_expected = 1.02809090909090912e+02 : passed
 [FAILED]  (0.00153 sec) Kokkos_View_PCE_DS_LayoutLeft_DeepCopy_NonContiguous_UnitTest
 Location: /vscratch1/trilinos/jaas/workspace/Trilinos_PR_cuda-10.1.243/Trilinos/packages/stokhos/test/UnitTest/Stokhos_KokkosViewUQPCEUnitTest.hpp:266

Current Status on CDash

Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.

Steps to Reproduce

Follow instructions at:

or see:

for specific instructions on how to build and run on 'vortex'.

@bartlettroscoe bartlettroscoe added type: bug The primary issue is a bug in Trilinos code or tests pkg: Stokhos impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area labels Oct 6, 2022
@bartlettroscoe
Copy link
Member Author

FYI: This failure took out my last PR build iteration #11099 (comment) (see #11099 (comment)).

@etphipp
Copy link
Contributor

etphipp commented Oct 25, 2022

So far I have not been able to reproduce this, either on the ATS2 platform or on a regular Linux platform (note the failing test is running with the Serial execution space, so whatever is going on isn't related to CUDA). I've also tried running the test under valgrind and with the clang address sanitizer. Both came up empty.

@bartlettroscoe
Copy link
Member Author

@etphipp, it was reported at the TUG today that Sacado might have some undefined memory issues. Does this use DFAD or the reverse AD types?

@etphipp
Copy link
Contributor

etphipp commented Oct 26, 2022

@etphipp, it was reported at the TUG today that Sacado might have some undefined memory issues. Does this use DFAD or the reverse AD types?

Yes. It is issue #7741. I never saw it because the team mention was invalid (which is probably a frighteningly common mistake due to the extra characters in the suggested team mention in the Trilinos issue template). I'm working on it now and believe I might have it fixed. It is due to the horribly designed memory management in RAD.

@GrahamBenHarper
Copy link
Contributor

I never saw it because the team mention was invalid (which is probably a frighteningly common mistake due to the extra characters in the suggested team mention in the Trilinos issue template).

I may be mistaken, but I believe that users who are not in the Trilinos Github group cannot tag individual Trilinos teams. This is why with a lot of recent issues you will see @cgcgcg working hard to tag the correct Trilinos teams as soon as they're opened.

@bartlettroscoe
Copy link
Member Author

I may be mistaken, but I believe that users who are not in the Trilinos Github group cannot tag individual Trilinos teams.

That is correct. That is a long-known flaw in the Trilinos Issue tracking processes.

@etphipp
Copy link
Contributor

etphipp commented Nov 28, 2022

Looking at the above query, the last failure was 10/5 and I was never able to reproduce it. So I am going to close this for now. If it fails again, please reopen it.

@etphipp etphipp closed this as completed Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impacting: tests The defect (bug) is primarily a test failure (vs. a build failure) PA: Data Services Issues that fall under the Trilinos Data Services Product Area pkg: Stokhos type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

3 participants