You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not using a custom entrypoint in my runner image
Controller Version
0.27.4
Helm Chart Version
0.23.3
CertManager Version
1.21.1
Deployment Method
Kustomize
cert-manager installation
Yes
Yes
Checks
This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
My actions-runner-controller version (v0.x.y) does support the feature
I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
I've migrated to the workflow job webhook event (if you using webhook driven scaling)
1. Use RunnerDeployment with HorizontalRunnerAutoscaler. Use a label that not a single workflow is using since we want to showcase an unrelated trigger.
2. Only use pull-based scaling by using `TotalNumberOfQueuedAndInProgressWorkflowRuns` and defining some repositories. The more active those repositories the better. Active repositories = lots of GHA action is important since the controller is checking every single run whether it's of relevance and this introduced the unwanted behaviour.
3. Wait for an unrelated scale-up to happen. This is triggered by the GitHub API. Either it's an edge case in their API or eventual consistency resulting in them returning an empty Job Array for a run that actually has jobs.
Describe the bug
GitHub's API is not always returning valid responses, while those are 200, they're empty.
The issue happens in the listWorkflowJobs function.
Two edge cases are caught by checking whether the runID == 0 or whether the GH API response contains no jobs len(allJobs) == 0. In both cases, a callback is executed, which doesn't make any sense.
Neither case should trigger the callback of increasing the queued or in_progress counter, which ultimately determines the suggested desired amount of runners.
Ultimately, this results in unrelated RunnerDeployments being scaled since any workflow request can return an empty job array.
Describe the expected behavior
If the Job Array from the GitHub API response is empty, we should not blindly scale RunnerDeployments but rather ignore it and let the next pull scrape check again.
Whole Controller Logs
I added my own debugging to figure out what was going on.
I changed the repo and job / run id. I checked them, they do contain jobs and they were some GitHub hosted label `ubuntu-`.
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Checking workflows for queued
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Found the following runnerLabels {"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Triggered CallBack - AllJobs == 0 {"run_id": 111111111, "repository": "repo"}
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Checking workflows for queued
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Found the following runnerLabels {"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Detected a job {"labels": ["ubuntu-22.04"], "run_id": 111111111, "job_id": 111111111, "repository": "repo"}
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Checking workflows for queued
2023-06-08T12:59:12Z INFO horizontalrunnerautoscaler Found the following runnerLabels {"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:13Z INFO horizontalrunnerautoscaler Detected a job {"labels": ["ubuntu-22.04"], "run_id": 111111111, "job_id": 2222222222, "repository": "repo"}
2023-06-08T12:59:13Z INFO horizontalrunnerautoscaler Checking workflows for queued
2023-06-08T12:59:13Z INFO horizontalrunnerautoscaler Found the following runnerLabels {"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:13Z INFO horizontalrunnerautoscaler Triggered CallBack - AllJobs == 0 {"run_id": 111111111, "repository": "repo"}
2023-06-08T12:59:16Z DEBUG horizontalrunnerautoscaler Suggested desired replicas of 2 by TotalNumberOfQueuedAndInProgressWorkflowRuns {"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 2, "workflow_runs_unknown": 0, "namespace": "gha-runner-camunda", "kind": "runnerdeployment", "name": "aws-core-2-unstable", "horizontal_runner_autoscaler": "aws-core-2-unstable"}
2023-06-08T12:59:16Z DEBUG horizontalrunnerautoscaler Calculated desired replicas of 2 {"horizontalrunnerautoscaler": "gha-runner-camunda/aws-core-2-unstable", "suggested": 2, "reserved": 0, "min": 0, "max": 20}
### Whole Runner Pod Logs
```shell
Runner Pod Logs are irrelevant since it's about the controller.
Additional Context
I found this discussion about the callback stuff.
I checked the code and I don't see a reason for keeping the callback behaviour and would suggest for both edge cases of allJobs == 0 and runID == 0 to simply ignore them rather than scaling up RunnerDeployments.
The text was updated successfully, but these errors were encountered:
TL;DR - My suggestion would be that we remove the callback behaviour from the listWorkflowJobs function and simply ignore these 2 edge cases since they shouldn't trigger a scale up.
It would be nice to have a maintainer look over it since I'm not sure whether it's needed for some undocumented reason.
This happens for our organization too and it causes us to spend additional money on EC2 instances that do not do any work. To help prioritize this we also opened a support ticket with GitHub and linked this issue in the ticket
Same here. After completing ie. a workflow with 4 jobs, right after all pods are terminated I'm getting one extra runner provisioned which is causing GKE to scale up an extra worker node which then sits doing nothing till it's downscaled.
Checks
Controller Version
0.27.4
Helm Chart Version
0.23.3
CertManager Version
1.21.1
Deployment Method
Kustomize
cert-manager installation
Checks
Resource Definitions
To Reproduce
Describe the bug
GitHub's API is not always returning valid responses, while those are 200, they're empty.
The issue happens in the listWorkflowJobs function.
Two edge cases are caught by checking whether the
runID == 0
or whether the GH API response contains no jobslen(allJobs) == 0
. In both cases, a callback is executed, which doesn't make any sense.Neither case should trigger the callback of increasing the
queued
orin_progress
counter, which ultimately determines the suggested desired amount of runners.Ultimately, this results in unrelated RunnerDeployments being scaled since any workflow request can return an empty job array.
Describe the expected behavior
If the Job Array from the GitHub API response is empty, we should not blindly scale RunnerDeployments but rather ignore it and let the next pull scrape check again.
Whole Controller Logs
Additional Context
I found this discussion about the callback stuff.
I checked the code and I don't see a reason for keeping the callback behaviour and would suggest for both edge cases of
allJobs == 0
andrunID == 0
to simply ignore them rather than scaling up RunnerDeployments.The text was updated successfully, but these errors were encountered: