fix: remove callbacks resulting in scales due to incomplete response #2671

Langleu · 2023-06-14T12:40:32Z

It removes the callbacks in the listWorkflowJobs function, which is only used by 2 edge cases, which shouldn't scale the RunnerDeployment.
GitHub requests can return an empty job array for a run even though it contains jobs. This results in unpredicted scale-ups of unrelated RunnerDeployments since no labels are checked but just blindly scaled.

I can also rework it into a feature flag to maintain the same functionality, but I currently don't see the use case.
The logging may be helpful for users to figure out stuck jobs, e.g. when GitHub Actions has an internal error, workflows can get stuck and it may not be noticeable to users. Therefore, helps to cancel those jobs once and for all that would otherwise trigger the previous edge case.

sergeykranga · 2023-06-22T16:11:19Z

after digging through the code base trying to understand why we see random scale ups of runners that actually don't do anything, we came to the same issue described here (thanks for opening the pull request too)!

this is affecting us as well, causing extra cost for ec2 instances. it would be great to get some attention from maintainers here

Langleu · 2023-06-27T10:36:33Z

@mumoshu, @toast-gear, @nikola-jokic anyone up or a review?
Would be nice to get it fixed since it can unknowingly result in tremendous bills.

mumoshu

Hi @Langleu! First of all, awesome job! I can admit that the removed logic was helpful only when there were a few jobs with different labels- that could be cases 2 or 3 years ago. Today, it might be worse than nothing as many folks started using self-hosted runners for relatively more complex use-cases.

LGTM. Let me merge this, and I will try to cut a new ARC patch version this week!

Thank you for your contribution!

Langleu · 2023-06-27T14:03:03Z

Hey @mumoshu, thanks for having a look at it!

Any good suggestions on how to fix the tests?
From what I can see the tests are completely built around runID == 0 callback, which in reality should never happen.

I played around a bit with it and we could give each single workflow run an id and mock via workflowJobs the response.
This only works if one passes a label like self-hosted since otherwise they're ignored. Which makes sense honestly, as the GHA controller shouldn't trigger on jobs that don't contain the self-hosted label.

Langleu · 2023-06-27T14:34:49Z

@mumoshu added in 45c975b how I imagine the tests to be fixed atm.
Some were unconsciously relying on the runID == 0 callback, which is the default in tests but not the reality.

Additionally added 2 small test cases to make sure that neither a hosted runner nor an empty Jobs array is scaling the runner.

Looking forward to getting it merged and having a new release! 🚀

Langleu · 2023-06-27T14:38:49Z

controllers/actions.summerwind.net/autoscaling_test.go

@@ -534,8 +584,9 @@ func TestDetermineDesiredReplicas_OrganizationalRunner(t *testing.T) {
 			workflowRuns:             `{"total_count": 4, "workflow_runs":[{"status":"in_progress"}, {"status":"in_progress"}, {"status":"in_progress"}, {"status":"completed"}]}"`,
 			workflowRuns_queued:      `{"total_count": 0, "workflow_runs":[]}"`,
 			workflowRuns_in_progress: `{"total_count": 3, "workflow_runs":[{"status":"in_progress"},{"status":"in_progress"},{"status":"in_progress"}]}"`,
-			want:                     3,
+			want:                     1,


this one is supposedly fixed at 3.
Maybe the test framework is broken but I didn't see the fixed value having any effect.
It only worked prior due to the callback secretly bumping the in_progress value

Good catch! However, I think this makes this test case useless- How about adding "id":123 or whatever to every workflow_runs items so that it still falls into the desired code path, allowing this want value to remain 3?

@mumoshu, looked a bit through the code but I don't think it makes any difference.
The fixed value influences the replicas of the RunnerDeployment and that ultimately is the scaleTarget but to calculate the desired amount of replicas needed, none of that information is used.
The workflows don't have a self-hosted label, which would result in the requirement of 3. Otherwise they're treated as 3 hosted runners and that means the desired replicas == minimum defined as they're not of importance.

All autoscaler tests are running against the AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns metric and none against AutoscalingMetricTypePercentageRunnersBusy. For the later, I could imagine the fixed, and therefore scaleTarget to have an influence but since we only check for inQueue / inProgress the fixed value and replicas of the RunnerDeployment have 0 influence on the desired value.

@mumoshu, I'd propose removing the fixed tests, since they're irrelevant for AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns tests.
Are you in accord with the change?

@mumoshu , just another ping since this is stalling a bit ...

@nikola-jokic, done.

lint looks like a timeout, works locally.
Same for the RunnerSet related tests since this change doesn't touch that logic and runs into timeout as well.

No worries, I'll re-run the tests. I know it is not affecting that part of the code

Thanks for your assistance @nikola-jokic!

controllers/actions.summerwind.net/autoscaling_test.go

mumoshu

Sorry for the delay! Apparently, I got confused only due to how tests are commented.
I've corrected some comments, and this looks good overall now.
Thanks for your patience and contribution @Langleu!
// I've been a bit out of bandwidth due to a lack of sponsorships.

…2671) Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

Langleu requested review from mumoshu, toast-gear, a team and nikola-jokic as code owners June 14, 2023 12:40

Link- added needs triage Requires review from the maintainers community Community contribution labels Jun 23, 2023

mumoshu previously approved these changes Jun 27, 2023

View reviewed changes

Langleu dismissed mumoshu’s stale review via 4246616 June 27, 2023 14:32

Langleu commented Jun 27, 2023

View reviewed changes

Langleu added 2 commits July 24, 2023 15:43

fix: remove callbacks resulting in scales due to incomplete response

864af39

fix: autoscaling tests related to callback removal

3c1b796

mumoshu reviewed Jul 24, 2023

View reviewed changes

controllers/actions.summerwind.net/autoscaling_test.go Outdated Show resolved Hide resolved

Update controllers/actions.summerwind.net/autoscaling_test.go

b932244

mumoshu reviewed Jul 24, 2023

View reviewed changes

controllers/actions.summerwind.net/autoscaling_test.go Outdated Show resolved Hide resolved

Update controllers/actions.summerwind.net/autoscaling_test.go

b1a6143

mumoshu reviewed Jul 24, 2023

View reviewed changes

controllers/actions.summerwind.net/autoscaling_test.go Outdated Show resolved Hide resolved

Update controllers/actions.summerwind.net/autoscaling_test.go

b522bd2

mumoshu approved these changes Jul 25, 2023

View reviewed changes

mumoshu merged commit 2974429 into actions:master Jul 25, 2023

Langleu deleted the remove-autoscaler-callback branch July 25, 2023 06:00

Link- pushed a commit that referenced this pull request Jul 26, 2023

fix: remove callbacks resulting in scales due to incomplete response (#…

cc1dcf4

…2671) Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: remove callbacks resulting in scales due to incomplete response #2671

fix: remove callbacks resulting in scales due to incomplete response #2671

Langleu commented Jun 14, 2023 •

edited

Loading

sergeykranga commented Jun 22, 2023

Langleu commented Jun 27, 2023

mumoshu left a comment

Langleu commented Jun 27, 2023

Langleu commented Jun 27, 2023 •

edited

Loading

Langleu Jun 27, 2023 •

edited

Loading

mumoshu Jun 28, 2023

Langleu Jun 28, 2023 •

edited

Loading

Langleu Jul 3, 2023

Langleu Jul 6, 2023

Langleu Jul 24, 2023

nikola-jokic Jul 24, 2023

Langleu Jul 24, 2023

nikola-jokic Jul 24, 2023

mumoshu Jul 25, 2023

mumoshu left a comment

fix: remove callbacks resulting in scales due to incomplete response #2671

fix: remove callbacks resulting in scales due to incomplete response #2671

Conversation

Langleu commented Jun 14, 2023 • edited Loading

sergeykranga commented Jun 22, 2023

Langleu commented Jun 27, 2023

mumoshu left a comment

Choose a reason for hiding this comment

Langleu commented Jun 27, 2023

Langleu commented Jun 27, 2023 • edited Loading

Langleu Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Langleu Jun 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mumoshu left a comment

Choose a reason for hiding this comment

Langleu commented Jun 14, 2023 •

edited

Loading

Langleu commented Jun 27, 2023 •

edited

Loading

Langleu Jun 27, 2023 •

edited

Loading

Langleu Jun 28, 2023 •

edited

Loading