Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: remove callbacks resulting in scales due to incomplete response #2671

Merged
merged 5 commits into from
Jul 25, 2023
Merged

fix: remove callbacks resulting in scales due to incomplete response #2671

merged 5 commits into from
Jul 25, 2023

Conversation

Langleu
Copy link
Contributor

@Langleu Langleu commented Jun 14, 2023

Fixes #2659

It removes the callbacks in the listWorkflowJobs function, which is only used by 2 edge cases, which shouldn't scale the RunnerDeployment.
GitHub requests can return an empty job array for a run even though it contains jobs. This results in unpredicted scale-ups of unrelated RunnerDeployments since no labels are checked but just blindly scaled.

I can also rework it into a feature flag to maintain the same functionality, but I currently don't see the use case.
The logging may be helpful for users to figure out stuck jobs, e.g. when GitHub Actions has an internal error, workflows can get stuck and it may not be noticeable to users. Therefore, helps to cancel those jobs once and for all that would otherwise trigger the previous edge case.

@Langleu Langleu requested review from mumoshu, toast-gear, a team and nikola-jokic as code owners June 14, 2023 12:40
@sergeykranga
Copy link

after digging through the code base trying to understand why we see random scale ups of runners that actually don't do anything, we came to the same issue described here (thanks for opening the pull request too)!

this is affecting us as well, causing extra cost for ec2 instances. it would be great to get some attention from maintainers here

@Link- Link- added needs triage Requires review from the maintainers community Community contribution labels Jun 23, 2023
@Langleu
Copy link
Contributor Author

Langleu commented Jun 27, 2023

@mumoshu, @toast-gear, @nikola-jokic anyone up or a review?
Would be nice to get it fixed since it can unknowingly result in tremendous bills.

mumoshu
mumoshu previously approved these changes Jun 27, 2023
Copy link
Collaborator

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Langleu! First of all, awesome job! I can admit that the removed logic was helpful only when there were a few jobs with different labels- that could be cases 2 or 3 years ago. Today, it might be worse than nothing as many folks started using self-hosted runners for relatively more complex use-cases.

LGTM. Let me merge this, and I will try to cut a new ARC patch version this week!

Thank you for your contribution!

@Langleu
Copy link
Contributor Author

Langleu commented Jun 27, 2023

Hey @mumoshu, thanks for having a look at it!

Any good suggestions on how to fix the tests?
From what I can see the tests are completely built around runID == 0 callback, which in reality should never happen.

I played around a bit with it and we could give each single workflow run an id and mock via workflowJobs the response.
This only works if one passes a label like self-hosted since otherwise they're ignored. Which makes sense honestly, as the GHA controller shouldn't trigger on jobs that don't contain the self-hosted label.

@Langleu
Copy link
Contributor Author

Langleu commented Jun 27, 2023

@mumoshu added in 45c975b how I imagine the tests to be fixed atm.
Some were unconsciously relying on the runID == 0 callback, which is the default in tests but not the reality.

Additionally added 2 small test cases to make sure that neither a hosted runner nor an empty Jobs array is scaling the runner.

Looking forward to getting it merged and having a new release! 🚀

@@ -534,8 +584,9 @@ func TestDetermineDesiredReplicas_OrganizationalRunner(t *testing.T) {
workflowRuns: `{"total_count": 4, "workflow_runs":[{"status":"in_progress"}, {"status":"in_progress"}, {"status":"in_progress"}, {"status":"completed"}]}"`,
workflowRuns_queued: `{"total_count": 0, "workflow_runs":[]}"`,
workflowRuns_in_progress: `{"total_count": 3, "workflow_runs":[{"status":"in_progress"},{"status":"in_progress"},{"status":"in_progress"}]}"`,
want: 3,
want: 1,
Copy link
Contributor Author

@Langleu Langleu Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one is supposedly fixed at 3.
Maybe the test framework is broken but I didn't see the fixed value having any effect.
It only worked prior due to the callback secretly bumping the in_progress value

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! However, I think this makes this test case useless- How about adding "id":123 or whatever to every workflow_runs items so that it still falls into the desired code path, allowing this want value to remain 3?

Copy link
Contributor Author

@Langleu Langleu Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu, looked a bit through the code but I don't think it makes any difference.
The fixed value influences the replicas of the RunnerDeployment and that ultimately is the scaleTarget but to calculate the desired amount of replicas needed, none of that information is used.
The workflows don't have a self-hosted label, which would result in the requirement of 3. Otherwise they're treated as 3 hosted runners and that means the desired replicas == minimum defined as they're not of importance.

All autoscaler tests are running against the AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns metric and none against AutoscalingMetricTypePercentageRunnersBusy. For the later, I could imagine the fixed, and therefore scaleTarget to have an influence but since we only check for inQueue / inProgress the fixed value and replicas of the RunnerDeployment have 0 influence on the desired value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu, I'd propose removing the fixed tests, since they're irrelevant for AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns tests.
Are you in accord with the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu , just another ping since this is stalling a bit ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nikola-jokic, done.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint looks like a timeout, works locally.
Same for the RunnerSet related tests since this change doesn't touch that logic and runs into timeout as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, I'll re-run the tests. I know it is not affecting that part of the code

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your assistance @nikola-jokic!

Copy link
Collaborator

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! Apparently, I got confused only due to how tests are commented.
I've corrected some comments, and this looks good overall now.
Thanks for your patience and contribution @Langleu!
// I've been a bit out of bandwidth due to a lack of sponsorships.

@mumoshu mumoshu merged commit 2974429 into actions:master Jul 25, 2023
@Langleu Langleu deleted the remove-autoscaler-callback branch July 25, 2023 06:00
Link- pushed a commit that referenced this pull request Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution needs triage Requires review from the maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pull-based autoscaling causes unrelated scale-ups due to incomplete GitHub response
7 participants