Pull-based autoscaling causes unrelated scale-ups due to incomplete GitHub response #2659

Langleu · 2023-06-09T13:40:59Z

Checks

I've already read https://github.com/actions/actions-runner-controller/blob/master/TROUBLESHOOTING.md and I'm sure my issue is not covered in the troubleshooting guide.
I'm not using a custom entrypoint in my runner image

Controller Version

0.27.4

Helm Chart Version

0.23.3

CertManager Version

1.21.1

Deployment Method

Kustomize

cert-manager installation

Yes
Yes

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
My actions-runner-controller version (v0.x.y) does support the feature
I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  labels:
    app: gha-runner-controller
  name: aws-core-2-unstable
  namespace: gha-runner
spec:
  template:
    spec:
      env:
      - name: DISABLE_RUNNER_UPDATE
        value: 'true'
      ephemeral: true
      image: summerwind/actions-runner:v2.304.0-ubuntu-22.04
      labels:
      - aws-core-2-default
      nodeSelector:
        nodegroup: core-8-unstable
      organization: myorg
      resources:
        limits:
          cpu: '1.8'
          memory: 6Gi
        requests:
          cpu: '1.8'
          memory: 6Gi
      serviceAccountName: agent
      volumeMounts:
      - mountPath: /etc/arc/hooks/job-started.d/label-podself.sh
        name: actions-runner-agent-config
        subPath: label-podself.sh
      volumes:
      - configMap:
          defaultMode: 493
          name: actions-runner-agent-config
        name: actions-runner-agent-config
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  labels:
    app: gha-runner-controller
  name: aws-core-2-unstable
  namespace: gha-runner
spec:
  maxReplicas: 20
  metrics:
  - repositoryNames:
    - repo1
    - repo2
    - repo3
    type: TotalNumberOfQueuedAndInProgressWorkflowRuns
  minReplicas: 0
  scaleDownDelaySecondsAfterScaleOut: 60
  scaleTargetRef:
    name: aws-core-2-unstable

To Reproduce

1. Use RunnerDeployment with HorizontalRunnerAutoscaler. Use a label that not a single workflow is using since we want to showcase an unrelated trigger.
2. Only use pull-based scaling by using `TotalNumberOfQueuedAndInProgressWorkflowRuns` and defining some repositories. The more active those repositories the better. Active repositories = lots of GHA action is important since the controller is checking every single run whether it's of relevance and this introduced the unwanted behaviour.
3. Wait for an unrelated scale-up to happen. This is triggered by the GitHub API. Either it's an edge case in their API or eventual consistency resulting in them returning an empty Job Array for a run that actually has jobs.

Describe the bug

GitHub's API is not always returning valid responses, while those are 200, they're empty.
The issue happens in the listWorkflowJobs function.
Two edge cases are caught by checking whether the runID == 0 or whether the GH API response contains no jobs len(allJobs) == 0. In both cases, a callback is executed, which doesn't make any sense.
Neither case should trigger the callback of increasing the queued or in_progress counter, which ultimately determines the suggested desired amount of runners.
Ultimately, this results in unrelated RunnerDeployments being scaled since any workflow request can return an empty job array.

Describe the expected behavior

If the Job Array from the GitHub API response is empty, we should not blindly scale RunnerDeployments but rather ignore it and let the next pull scrape check again.

Whole Controller Logs

I added my own debugging to figure out what was going on.
I changed the repo and job / run id. I checked them, they do contain jobs and they were some GitHub hosted label `ubuntu-`.


2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Triggered CallBack - AllJobs == 0	{"run_id": 111111111, "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Detected a job	{"labels": ["ubuntu-22.04"], "run_id": 111111111, "job_id": 111111111, "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Detected a job	{"labels": ["ubuntu-22.04"], "run_id": 111111111, "job_id": 2222222222, "repository": "repo"}
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Triggered CallBack - AllJobs == 0	{"run_id": 111111111, "repository": "repo"}
2023-06-08T12:59:16Z	DEBUG	horizontalrunnerautoscaler	Suggested desired replicas of 2 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 2, "workflow_runs_unknown": 0, "namespace": "gha-runner-camunda", "kind": "runnerdeployment", "name": "aws-core-2-unstable", "horizontal_runner_autoscaler": "aws-core-2-unstable"}
2023-06-08T12:59:16Z	DEBUG	horizontalrunnerautoscaler	Calculated desired replicas of 2	{"horizontalrunnerautoscaler": "gha-runner-camunda/aws-core-2-unstable", "suggested": 2, "reserved": 0, "min": 0, "max": 20}



### Whole Runner Pod Logs

```shell
Runner Pod Logs are irrelevant since it's about the controller.

Additional Context

I found this discussion about the callback stuff.
I checked the code and I don't see a reason for keeping the callback behaviour and would suggest for both edge cases of allJobs == 0 and runID == 0 to simply ignore them rather than scaling up RunnerDeployments.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-06-09T13:41:32Z

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

Langleu · 2023-06-09T13:46:24Z

TL;DR - My suggestion would be that we remove the callback behaviour from the listWorkflowJobs function and simply ignore these 2 edge cases since they shouldn't trigger a scale up.

It would be nice to have a maintainer look over it since I'm not sure whether it's needed for some undocumented reason.

alexanderkranga · 2023-06-22T16:08:24Z

This happens for our organization too and it causes us to spend additional money on EC2 instances that do not do any work. To help prioritize this we also opened a support ticket with GitHub and linked this issue in the ticket

marcinkubica · 2023-06-28T11:59:47Z

Same here. After completing ie. a workflow with 4 jobs, right after all pods are terminated I'm getting one extra runner provisioned which is causing GKE to scale up an extra worker node which then sits doing nothing till it's downscaled.

Langleu added bug Something isn't working needs triage Requires review from the maintainers labels Jun 9, 2023

Langleu mentioned this issue Jun 14, 2023

fix: remove callbacks resulting in scales due to incomplete response #2671

Merged

mumoshu closed this as completed in #2671 Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pull-based autoscaling causes unrelated scale-ups due to incomplete GitHub response #2659

Pull-based autoscaling causes unrelated scale-ups due to incomplete GitHub response #2659

Langleu commented Jun 9, 2023 •

edited

Loading

github-actions bot commented Jun 9, 2023

Langleu commented Jun 9, 2023

alexanderkranga commented Jun 22, 2023

marcinkubica commented Jun 28, 2023 •

edited

Loading

Pull-based autoscaling causes unrelated scale-ups due to incomplete GitHub response #2659

Pull-based autoscaling causes unrelated scale-ups due to incomplete GitHub response #2659

Comments

Langleu commented Jun 9, 2023 • edited Loading

Checks

Controller Version

Helm Chart Version

CertManager Version

Deployment Method

cert-manager installation

Checks

Resource Definitions

To Reproduce

Describe the bug

Describe the expected behavior

Whole Controller Logs

Additional Context

github-actions bot commented Jun 9, 2023

Langleu commented Jun 9, 2023

alexanderkranga commented Jun 22, 2023

marcinkubica commented Jun 28, 2023 • edited Loading

Langleu commented Jun 9, 2023 •

edited

Loading

marcinkubica commented Jun 28, 2023 •

edited

Loading