-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Belos/FEI/Ifpack2/NOX/Amesos2: Randomly hanging tests on ascisc1** breaking PR builds #11530
Comments
Automatic mention of the @trilinos/ifpack2 team |
10 similar comments
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Automatic mention of the @trilinos/ifpack2 team |
Looking at them (I've been hit by two of them as well), I think that three of them exhibit the same behavior (they say they pass, but then time out). The Ifpack2 one actually shows a failure as part of the test, so it may be a different root cause? Should we just disable the other three for now? I'm loathe to do that, but I suppose it depends on how long to debug/fix them. |
When we were switching to the new SEMS V2 modules, we did run nightly tests with the new configurations. |
@csiefer2 alternatively, if this is too big a disruption right now, we could explore backing off the GCC toolchain change and go back to SEMS V1 modules temporarily. And maybe add a nightly build with those specific failing builds on the specific machines? Let me know your preference/thoughts. We (Framework and SEMS) really want to get off of the old modules stack regardless, so I don't want to lose momentum here, but I also don't want to adversely affect the ability of the autotester to get stuff into develop. |
Disable the tests and file issues might be the way to go. Lots of the"says passed but failed" tests are sue to output order of the PASSED message |
I could not reproduce this on my local workstation. I used the instructions for reproducing using genconfig. Will need some help reproducing. |
#11466 is also getting tripped up by random failures. |
@rppawlo what CPU architecture is your local workstation? Model can be grabbed with the 'lscpu' command. |
I am able to reproduce the FEI failure on my workstation (Cascade Lake) by doing this: It doesn't fail all of the time, but at least one in ten does seem to fail. I suspect this will be the case with the other ones that hang and time out as well. I'll try the Ifpack2 test next. |
User Support Ticket(s) or Story Referenced: trilinos#11530
If it's exhibiting the same behavior, I would say probably yes. I've added it to the list of disables in #11567, let's see if I can get the syntax right and see what PR testing has to say with those five tests disabled. I'll defer to pretty much anybody else about how to handle the failures. I think @csiefer2 was in favor of disabling them and filing issues (this issue perhaps). My big concern is whether or not this is related to some kind of issue with the new SEMS toolchain (the OpenMPI 1.10.7 one), and if we're playing whack-a-mole with unstable tests only to discover that it's some deeper issue that is similar across all of them. |
Disable should be 'on' to disable.... User Support Ticket(s) or Story Referenced: trilinos#11530
For the nox tests, this looks like an mpi issue. I was able to replicate the failures after wiping the genconfig directory and then rerunning get_dependencies. Even though I specified openmpi 1.10.7 in the string to genconfig, it must have not had an updated submodule and somehow pulled in the 1.10.1 libraries. Thanks to @sebrowne for the ldd info. I believe the issue is in the mpi tpl. The code runs fine and the application executables for each mpi rank exit clean. Then the mpi hangs inside the mpirun executable. Here’s the gdb stack trace: (gdb) bt This hang causes a timeout in ctest that triggers failure. Maybe something in the mpi install is machine specific? We might be able to fix by moving to a new version of mpi. In my local builds, using mpi 1.10.1 ran fine without failure. |
This is probably from the system being updated, but the MPI not being rebuilt ontop of it. (looking at the timestamps on the libraries from ldd and comparing to the MPI lib's timestamps can give a hint if a core system lib has changed after MPI was built) |
User Support Ticket(s) or Story Referenced: trilinos#11530
This is now blocking development enough that I'm going to roll back the upgrade this morning and go back to the older SEMS modules for the GCC toolchains (once I discuss it with the team and make sure it will still work). Now that we have a reliable-ish reproducer, I've added a story to ensure that reproducer is fixed prior to re-deploying the SEMS module change. |
It looks like all of these failures show that the test passes on the root process (i.e. proc 0) but MPI says that other ranks failed. In fact, it looks to always be 1 process that fails showing:
If the tests are using the Teuchos unit test harness correctly, that should be impossible. So this must be a problem with the system. I will post the full extent of these errors and what tests and builds they involve in the next comment. |
Okay, I've reverted the configuration now. This should stop impacting people immediately (at least that's the hope), and we have work/testing to figure it out prior to merging it again in the future. |
Broadening the query looking for tests that failed, have 'timeout', and show "Primary job terminated normally, but .* process returned" since 1/28/2023, we see the full scope of these failures ... As shown in this query (click "Shown Matching Output" in upper right) the tests:
in the unique GenConfig builds:
started randomly hanging on testing day 2023-01-30. The specific set of CDash builds impacted where:
As you can see, this seems to only be impacting the It is interesting that these failures are isolated to just these 5 tests over all that time. What is unique about these tests that differentiates them from other NOTE: I could not go back further than 1/28/2023 or CDash just hung. (That may be a memory constraint with the CDash PHP setup on trilinos-cdash.sandia.gov that needs to be resolved.) |
@sebrowne, FYI, but @ndellingwood just reported the same error in the test Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests_MPI_1 failure in a PR iteration yesterday shown in #11579 (comment). |
Even though it's one of the same tests, the failure mode is different. That one failed when the test was starting with a CUDA device-side assert. The issue here was specifically tests that hung at the end of their run (MPI_finalize(), perhaps). With respect to that failure, I think it may be alleviated by #11391 which stops running a bunch of simultaneous tests on one GPU of the machine, potentially leading to resource issues. Or it may be a real bug in Intrepid2 which triggered the assert, I couldn't say for sure. Also note that the CUDA line has been using OpenMPI 4.0.5 for a while (it was not part of this change), so I really don't think it's the same failure case. We've seen spurious CUDA failures off and on since I started on the team, which is part of the motivation for #11391 when I was looking into them. |
Okay, that is right. Thanks for clarifying. But that means we still have a randomly failing test Intrepid2_unit-test_MonolithicExecutable_Intrepid2_Tests_MPI_1 taking down PR builds as shown in #11579 (comment). |
Addresses issue trilinos#11530
@sebrowne - I believe all the tests mentioned in this ticket have now been fixed or disabled. You could try upgrading the compiler stack again. |
This issue has had no activity for 365 days and is marked for closure. It will be closed after an additional 30 days of inactivity. |
This issue was closed due to inactivity for 395 days. |
Next Action Status
Description
As shown in this query (click "Shown Matching Output" in upper right) the tests:
Belos_Tpetra_tfqmr_hb_2_MPI_4
FEI_fei_ubase_MPI_3
Ifpack2_SGSMT_compare_with_Jacobi_MPI_4
NOX_1DfemStratimikos_MPI_4
in the unique GenConfig builds:
rhel7_sems-gnu-8.3.0-openmpi-1.10.7-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
rhel7_sems-gnu-8.3.0-openmpi-1.10.7-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables
started failing on testing day 2023-01-30.
The specific set of CDash builds impacted where:
PR-11484-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.7-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1717
PR-11516-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.7-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1702
PR-11516-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.7-openmp_release-debug_static_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-1708
PR-11516-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.7-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-228
PR-11516-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.7-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-230
PR-11523-test-rhel7_sems-gnu-8.3.0-openmpi-1.10.7-serial_debug_shared_no-kokkos-arch_no-asan_no-complex_no-fpic_mpi_no-pt_no-rdc_no-uvm_deprecated-on_no-package-enables-241
<Add details about what is failing and what the failures look like. Make sure to include strings that are easy to match with GitHub Issue searches.>
Current Status on CDash
Run the above query adjusting the "Begin" and "End" dates to match today any other date range or just click "CURRENT" in the top bar to see results for the current testing day.
Steps to Reproduce
See:
If you can't figure out what commands to run to reproduce the problem given this documentation, then please post a comment here and we will give you the exact minimal commands.
The text was updated successfully, but these errors were encountered: