Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Framework: Jenkins PR driver down and skipping of odd-numbered PRs starting 2022-08-16 #10917

Closed
bartlettroscoe opened this issue Aug 22, 2022 · 14 comments
Labels
Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests

Comments

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 22, 2022

Bug Report

@trilinos/framework

Internal Issues

Description

The Jenkins PR tester driver became inoperable starting 2022-08-16 as explained in #10858 (comment). As a result, only even-numbered PRs have been getting testing resulting in a large backup of odd-numbered PRs waiting many days to get tested (currently 8 as shown in this PR query).

Steps to Reproduce

Look at:

@bartlettroscoe bartlettroscoe added the type: bug The primary issue is a bug in Trilinos code or tests label Aug 22, 2022
@bartlettroscoe bartlettroscoe added Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area labels Aug 22, 2022
@tasmith4
Copy link
Contributor

@bartlettroscoe can you "share" that internal issue with me? I'll also note #10851 appears to be bit by this issue as well although it does not show up in your query -- I don't have RETEST on it since it passed already, but apparently even though it is approved and passing tests, it needs the autotester to circle back before it can merge ...

@bartlettroscoe
Copy link
Member Author

@tasmith4

@bartlettroscoe can you "share" that internal issue with me?

Done. See TRILINOSHD-186

I'll also note #10851 appears to be bit by this issue as well although it does not show up in your query -- I don't have RETEST on it since it passed already, but apparently even though it is approved and passing tests, it needs the autotester to circle back before it can merge ...

Correct.

@bartlettroscoe
Copy link
Member Author

Looks like "Instance 1" is running again just now:

@e10harvey closed TRILINOSHD-186.

@bartlettroscoe
Copy link
Member Author

bartlettroscoe commented Aug 23, 2022

FYI: Another thing I learned as part of this is how the PR testing system selects what PRs to test. Of all of the PRs that are requested and eligible to be tested, it will select the PR with the lowest integer number. For example, of all of the 8 PRs requested to be tested shown here, it selected the oldest PR #10751 (i.e. with the lowest integer number). Therefore, for any new odd-numbered PRs, you will have wait for 16 PR build iterations (2 per PR assuming they fail) to get run before your odd-numbered PR gets tested.

The solution to this problem is the create a dummy even-numbered PR with a new branch name but same commits and that will get tested right away (because there are currently no even-numbered PRs in the queue waiting to be tested).

@jhux2
Copy link
Member

jhux2 commented Aug 23, 2022

Were the number of test lanes always so low, or were they reduced over time due to resource contention?

@bartlettroscoe
Copy link
Member Author

Were the number of test lanes always so low, or were they reduced over time due to resource contention?

@jhux2, from looking at:

it appears they had as many as 4 of these Trilinos_autotester_driver_inst_<idx> jobs running at one time.

@tasmith4
Copy link
Contributor

The solution to this problem is the create a dummy even-numbered PR with a new branch name but same commits and that will get tested right away (because there are currently no even-numbered PRs in the queue waiting to be tested).

@bartlettroscoe I tried this with #10851, but the new even-numbered PR #10920 also seems to be ignored by the autotester. Where can I check the queues to see what's running and why my new one is also blocked? I visited the Jenkins page but wasn't sure exactly where to look for this info.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe I tried this with #10851, but the new even-numbered PR #10920 also seems to be ignored by the autotester. Where can I check the queues to see what's running and why my new one is also blocked? I visited the Jenkins page but wasn't sure exactly where to look for this info.

@tasmith4, you can see the most recent runs of the two PR drivers shown at:

That PR ID 10920 is not even listed in that output.

I think the reason it did not test PR #10920 is because that PR target for PR #10920 is 'master' instead of 'develop'. Please change the target branch of PR #10920 to 'develop' and it should get tested.

@tasmith4
Copy link
Contributor

I think the reason it did not test PR #10920 is because that PR target for PR #10920 is 'master' instead of 'develop'.

Wow. Don't know how I missed that ... thanks. :)

@cgcgcg
Copy link
Contributor

cgcgcg commented Aug 24, 2022

@bartlettroscoe Do I need to do anything to get an odd-numbered PR scheduled for testing now? #10905 has been sitting idle for 5 days.

@bartlettroscoe
Copy link
Member Author

@bartlettroscoe Do I need to do anything to get an odd-numbered PR scheduled for testing now? #10905 has been sitting idle for 5 days.

@cgcgcg, odd-numbered PRs are being tested again. See above. From the latest Jenkins output, it is currently testing PR #10827. The problem is that there is a big backlog of odd-numbered PRs waiting to be tested. From what I can figure out just looking at the autotester output on Jenkins, looking at the current odd-numbered PRs with AT: RETEST listed here, it looks like there are 4 odd-numbered PRs ahead of #10905. I don't know the exact logic for selecting what PRs to test (because I have never seen the autotester code), but it sure looks to me like it is just iterating over the PRs in ascending integer ID order and picking the first PR that meets the testing criteria. (So if the odd-numbered PRs with lower IDs have AT: RETEST set on them over and over before the autotester loops around again and they continue to fail, then the later PRs will never get tested). You will have to ask the @trilinos/framework team for more info.

@jhux2
Copy link
Member

jhux2 commented Aug 24, 2022

Were the number of test lanes always so low, or were they reduced over time due to resource contention?

@jhux2, from looking at:

* https://do.sandia.gov/trilinos-ci/

it appears they had as many as 4 of these Trilinos_autotester_driver_inst_<idx> jobs running at one time.

As things stabilize, perhaps Framework could try increasing the lanes, even by one.

@bartlettroscoe
Copy link
Member Author

See, creating the even-numbered PR #10920 caused it to be tested right away and once it was approved and merged, GitHub is automatically showing the identical PR branch #10851 as merged.

@bartlettroscoe
Copy link
Member Author

Odd-numbered PRs are definitely getting tested again as shown here which shows the testing of PR #10829 here showing:

image

Closing this as complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Framework tasks Framework tasks (used internally by Framework team) PA: Framework Issues that fall under the Trilinos Framework Product Area type: bug The primary issue is a bug in Trilinos code or tests
Projects
None yet
Development

No branches or pull requests

4 participants