-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remove callbacks resulting in scales due to incomplete response #2671
Conversation
after digging through the code base trying to understand why we see random scale ups of runners that actually don't do anything, we came to the same issue described here (thanks for opening the pull request too)! this is affecting us as well, causing extra cost for ec2 instances. it would be great to get some attention from maintainers here |
@mumoshu, @toast-gear, @nikola-jokic anyone up or a review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Langleu! First of all, awesome job! I can admit that the removed logic was helpful only when there were a few jobs with different labels- that could be cases 2 or 3 years ago. Today, it might be worse than nothing as many folks started using self-hosted runners for relatively more complex use-cases.
LGTM. Let me merge this, and I will try to cut a new ARC patch version this week!
Thank you for your contribution!
Hey @mumoshu, thanks for having a look at it! Any good suggestions on how to fix the tests? I played around a bit with it and we could give each single workflow run an id and mock via |
@mumoshu added in 45c975b how I imagine the tests to be fixed atm. Additionally added 2 small test cases to make sure that neither a hosted runner nor an empty Jobs array is scaling the runner. Looking forward to getting it merged and having a new release! 🚀 |
@@ -534,8 +584,9 @@ func TestDetermineDesiredReplicas_OrganizationalRunner(t *testing.T) { | |||
workflowRuns: `{"total_count": 4, "workflow_runs":[{"status":"in_progress"}, {"status":"in_progress"}, {"status":"in_progress"}, {"status":"completed"}]}"`, | |||
workflowRuns_queued: `{"total_count": 0, "workflow_runs":[]}"`, | |||
workflowRuns_in_progress: `{"total_count": 3, "workflow_runs":[{"status":"in_progress"},{"status":"in_progress"},{"status":"in_progress"}]}"`, | |||
want: 3, | |||
want: 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is supposedly fixed at 3.
Maybe the test framework is broken but I didn't see the fixed
value having any effect.
It only worked prior due to the callback
secretly bumping the in_progress
value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! However, I think this makes this test case useless- How about adding "id":123
or whatever to every workflow_runs
items so that it still falls into the desired code path, allowing this want
value to remain 3
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumoshu, looked a bit through the code but I don't think it makes any difference.
The fixed value influences the replicas of the RunnerDeployment
and that ultimately is the scaleTarget
but to calculate the desired amount of replicas needed, none of that information is used.
The workflows don't have a self-hosted
label, which would result in the requirement of 3. Otherwise they're treated as 3 hosted runners and that means the desired replicas == minimum defined as they're not of importance.
All autoscaler tests are running against the AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns
metric and none against AutoscalingMetricTypePercentageRunnersBusy
. For the later, I could imagine the fixed
, and therefore scaleTarget
to have an influence but since we only check for inQueue / inProgress
the fixed
value and replicas of the RunnerDeployment have 0 influence on the desired
value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumoshu, I'd propose removing the fixed
tests, since they're irrelevant for AutoscalingMetricTypeTotalNumberOfQueuedAndInProgressWorkflowRuns
tests.
Are you in accord with the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumoshu , just another ping since this is stalling a bit ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikola-jokic, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lint looks like a timeout, works locally.
Same for the RunnerSet related tests since this change doesn't touch that logic and runs into timeout as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries, I'll re-run the tests. I know it is not affecting that part of the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your assistance @nikola-jokic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay! Apparently, I got confused only due to how tests are commented.
I've corrected some comments, and this looks good overall now.
Thanks for your patience and contribution @Langleu!
// I've been a bit out of bandwidth due to a lack of sponsorships.
…2671) Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
Fixes #2659
It removes the callbacks in the
listWorkflowJobs
function, which is only used by 2 edge cases, which shouldn't scale the RunnerDeployment.GitHub requests can return an empty job array for a run even though it contains jobs. This results in unpredicted scale-ups of unrelated RunnerDeployments since no labels are checked but just blindly scaled.
I can also rework it into a feature flag to maintain the same functionality, but I currently don't see the use case.
The logging may be helpful for users to figure out stuck jobs, e.g. when GitHub Actions has an internal error, workflows can get stuck and it may not be noticeable to users. Therefore, helps to cancel those jobs once and for all that would otherwise trigger the previous edge case.