Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: Issues with Trilinos PR testing 2022-08 #10858

Closed
bartlettroscoe opened this issue Aug 10, 2022 · 12 comments
Closed

EPIC: Issues with Trilinos PR testing 2022-08 #10858

bartlettroscoe opened this issue Aug 10, 2022 · 12 comments
Labels
PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 10, 2022

There are a number of issues impacting Trilinos PR testing currently that are blocking many active PRs from passing the PR tester:

Also of interest:

@bartlettroscoe bartlettroscoe added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 10, 2022
@bartlettroscoe bartlettroscoe pinned this issue Aug 10, 2022
@bartlettroscoe bartlettroscoe added the PA: Framework Issues that fall under the Trilinos Framework Product Area label Aug 10, 2022
@bartlettroscoe bartlettroscoe changed the title Issues with Trilinos PR testing 8/2022 Issues with Trilinos PR testing 2022-08 Aug 14, 2022
@csiefer2
Copy link
Member

#10896

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 17, 2022

#10896

As explained in #10896 (comment), there are not causes of PR failures, they are victims of a build failure that was also reported in those same PR builds (or not for the clang-10.0.0 builds using older CMake 3.17.1 as per #10893 (comment) :-( ).

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 19, 2022

Even with all of the fixes in the linked issues above, we still have a large log jam in PR testing. For example, right now this issues query shows there are 8 PRs that have the AT: RETEST label on them. It looks like something is wrong with the PR tester is that only one set of PR build is running at one time. The outer driver for the second set of PRs looks to be broken (see https://do.sandia.gov/trilinos-ci/job/Trilinos_autotester_driver_inst_1/). I will post a Trilinos HelpDesk issue.

Update: Here is the helpdesk issue: TRILINOSHD-186

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 19, 2022

FYI: I posted TRILINOSHD-187 to see if they can turn off auto-retest of failing PR builds. That is often a waste of computing resources as the developer has not even had a chance to look over the failed PR build results to see if rerunning the PR builds is likely to fix anything.

@bartlettroscoe bartlettroscoe changed the title Issues with Trilinos PR testing 2022-08 EPIC: Issues with Trilinos PR testing 2022-08 Aug 20, 2022
@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 21, 2022

As noted above, only one of two Jenkins Trilinos_autotester_driver_inst instances is running and there seems to be logic built into assume that the other instance is running PR builds. For example:

shows:

22:00:38 PROCESSING prnum = 10897
22:00:38    PROCESSING PULL REQ# 10897
22:00:38       - PRINFO - Mergeable: True; State: open; Title: Improve TrilinosInstallTests to make more clear (#10896); By: bartlettroscoe
22:00:38       - SOURCE - Repo: bartlettroscoe/Trilinos; branch: 10896-improve-doc-install-tests; sha: 00ca3870b23ddea877be9f1abbaf1e3bfee833ad; User: bartlettroscoe
22:00:38       - TARGET - Repo: trilinos/Trilinos; branch: develop; sha: f4b73d31ef4f77626c1c7c8c9a86034cc64fa011; User: trilinos
22:00:38       - PR From Fork = True
22:00:38 *** NOTICE: This PR #10897 is not tested by this instance of the pullrequestautotester_inst_0; Instance 1 will test it - SKIPPING THIS PULL 

So the PR tester logic seems to think that the other instance (Instance 1) will be testing my PR #10897 but that instance appears to be broken a shown at, for example:

showing:

21:42:00 Started by timer
21:42:00 Running as SYSTEM
21:42:01 Building remotely on [pr-host-1](https://do.sandia.gov/trilinos-ci/computer/pr-host-1) (ceecloud_PR_host_1 PR_host) in workspace <...>/Trilinos_autotester_driver_inst_1
21:42:01 [WS-CLEANUP] Deleting project workspace...
21:42:01 [WS-CLEANUP] Deferred wipeout is used...
21:42:07 ERROR: [WS-CLEANUP] Cannot delete workspace: Remote call on pr-host-1 failed
21:42:07 ERROR: Cannot delete workspace: Remote call on pr-host-1 failed
21:42:07 Finished: FAILURE

I think as a result, all of the 8 PRs with AT: RETEST put on them seem to be assigned to run in PR "Instance 1" which is broken.

As a result, no PR builds have started since Aug 19, 2022 - 18:08 MDT shown here.

This system does not seem to be very robust to instances of these drivers going down.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 21, 2022

Looking over the PR builds being skipped with the message:

*** NOTICE: This PR #<prnum> is not tested by this instance of the pullrequestautotester_inst_0; Instance 1 will test it - SKIPPING THIS PULL REQUEST

at:

I am noticing that all of the <prnum> like #8161, #8737, #8855, ..., #10909 (41 PRs in total) are all odd numbers. That suggests that the PR tester's approach to load balancing is to use <prnum> % num_pr_drivers to compute which PR drivers instance 0, 1, ... will be testing a given PR. So with two running instances, odd numbered PRs compute to <prnum> % num_pr_drivers.

So if you want to get your PR to be tested, just open a new PR and make sure the PR number is even :-)

@bartlettroscoe
Copy link
Member Author

Looking at the history for the broken PR tester driver:

it seems that the Trilinos_autotester_driver_inst_1 last ran successfully on 'Aug 16, 2022, 2:03:02 PM'.

After that, starting on 'Aug 17, 2022, 6:37:29 PM' for:

that instance was broken and not running any PR builds.

So it seems that the PR tester implementation for the last 4 days has been running only one set of PR builds at a time and then only testing even-numbered PRs :-)

And to back this up, this CDash query shows that the last time an odd-numbered PR was tested was Aug 17, 2022 - 14:18 MDT with the build:

After that, only even-numbered PRs were tested.

And if you look at the current list of PRs with AT: RETEST on them here, you will see they are all odd-numbered PRs.

(NOTE: Those dates don't exactly match up so not sure how an odd-numbered PR was able to be tested after 'Aug 16, 2022, 2:03:02 PM' but the general trend of only even-number PRs getting tested and only odd-numbered PRs that still have AT: RETEST holds.)

@bartlettroscoe
Copy link
Member Author

FYI: I created the even-numbered PR #10912 to test the above even-only PR testing hypothesis. But it looks like PR builds will not start up again until tomorrow morning after 7 AM MDT so we will have to wait until then.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 21, 2022

As predicted above, the next job fired off at 7:16 AM MDT with:

and it is running PR builds for the even-numbered PR #10912 showing:

09:22:10    * ========================
09:22:10    * PERFORMING TESTING ON PULL REQ# 10912 (Run Number 1/2)
09:22:10    * ========================

(Jenkins must be showing times according to my browser so this must be 09:22:10 EDT, not MDT).

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe
Copy link
Member Author

FYI: I think the major errors blocking PR builds from passing have been resolved. Now there are just a few random failures like #6861 that are still taking down PR testing iterations.

@bartlettroscoe
Copy link
Member Author

FYI As (not) shown in this GitHub Issue Query there are no more PRs with the AT: RETEST label. And no PR builds have started in the last 22 hours And the two drivers are running with the latest runs shown at:

and

And this GitHub Issue Query shows that there are just 8 open PRs with approved reviews and of those 7 have a failing PR builds status.

Therefore, the logjam of PR builds is over. Yes, there are still a few sources of random failures that may trigger a failing PR testing iteration (i.e. #6861 and #10989) but everything else seems to have been addressed.

I will now close this EPIC and remove the pin.

@bartlettroscoe bartlettroscoe unpinned this issue Aug 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

2 participants