Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pull-based autoscaling causes unrelated scale-ups due to incomplete GitHub response #2659

Closed
7 tasks done
Langleu opened this issue Jun 9, 2023 · 4 comments · Fixed by #2671
Closed
7 tasks done
Labels
bug Something isn't working needs triage Requires review from the maintainers

Comments

@Langleu
Copy link
Contributor

Langleu commented Jun 9, 2023

Checks

Controller Version

0.27.4

Helm Chart Version

0.23.3

CertManager Version

1.21.1

Deployment Method

Kustomize

cert-manager installation

  • Yes
  • Yes

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions. It might also be a good idea to contract with any of contributors and maintainers if your business is so critical and therefore you need priority support
  • I've read releasenotes before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes
  • My actions-runner-controller version (v0.x.y) does support the feature
  • I've already upgraded ARC (including the CRDs, see charts/actions-runner-controller/docs/UPGRADING.md for details) to the latest and it didn't fix the issue
  • I've migrated to the workflow job webhook event (if you using webhook driven scaling)

Resource Definitions

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  labels:
    app: gha-runner-controller
  name: aws-core-2-unstable
  namespace: gha-runner
spec:
  template:
    spec:
      env:
      - name: DISABLE_RUNNER_UPDATE
        value: 'true'
      ephemeral: true
      image: summerwind/actions-runner:v2.304.0-ubuntu-22.04
      labels:
      - aws-core-2-default
      nodeSelector:
        nodegroup: core-8-unstable
      organization: myorg
      resources:
        limits:
          cpu: '1.8'
          memory: 6Gi
        requests:
          cpu: '1.8'
          memory: 6Gi
      serviceAccountName: agent
      volumeMounts:
      - mountPath: /etc/arc/hooks/job-started.d/label-podself.sh
        name: actions-runner-agent-config
        subPath: label-podself.sh
      volumes:
      - configMap:
          defaultMode: 493
          name: actions-runner-agent-config
        name: actions-runner-agent-config
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  labels:
    app: gha-runner-controller
  name: aws-core-2-unstable
  namespace: gha-runner
spec:
  maxReplicas: 20
  metrics:
  - repositoryNames:
    - repo1
    - repo2
    - repo3
    type: TotalNumberOfQueuedAndInProgressWorkflowRuns
  minReplicas: 0
  scaleDownDelaySecondsAfterScaleOut: 60
  scaleTargetRef:
    name: aws-core-2-unstable

To Reproduce

1. Use RunnerDeployment with HorizontalRunnerAutoscaler. Use a label that not a single workflow is using since we want to showcase an unrelated trigger.
2. Only use pull-based scaling by using `TotalNumberOfQueuedAndInProgressWorkflowRuns` and defining some repositories. The more active those repositories the better. Active repositories = lots of GHA action is important since the controller is checking every single run whether it's of relevance and this introduced the unwanted behaviour.
3. Wait for an unrelated scale-up to happen. This is triggered by the GitHub API. Either it's an edge case in their API or eventual consistency resulting in them returning an empty Job Array for a run that actually has jobs.

Describe the bug

GitHub's API is not always returning valid responses, while those are 200, they're empty.
The issue happens in the listWorkflowJobs function.
Two edge cases are caught by checking whether the runID == 0 or whether the GH API response contains no jobs len(allJobs) == 0. In both cases, a callback is executed, which doesn't make any sense.
Neither case should trigger the callback of increasing the queued or in_progress counter, which ultimately determines the suggested desired amount of runners.
Ultimately, this results in unrelated RunnerDeployments being scaled since any workflow request can return an empty job array.

Describe the expected behavior

If the Job Array from the GitHub API response is empty, we should not blindly scale RunnerDeployments but rather ignore it and let the next pull scrape check again.

Whole Controller Logs

I added my own debugging to figure out what was going on.
I changed the repo and job / run id. I checked them, they do contain jobs and they were some GitHub hosted label `ubuntu-`.


2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Triggered CallBack - AllJobs == 0	{"run_id": 111111111, "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Detected a job	{"labels": ["ubuntu-22.04"], "run_id": 111111111, "job_id": 111111111, "repository": "repo"}
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:12Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Detected a job	{"labels": ["ubuntu-22.04"], "run_id": 111111111, "job_id": 2222222222, "repository": "repo"}
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Checking workflows for queued
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Found the following runnerLabels	{"runner_labels": ["aws-core-2-default"], "runner_name": "aws-core-2-unstable", "repository": "repo"}
2023-06-08T12:59:13Z	INFO	horizontalrunnerautoscaler	Triggered CallBack - AllJobs == 0	{"run_id": 111111111, "repository": "repo"}
2023-06-08T12:59:16Z	DEBUG	horizontalrunnerautoscaler	Suggested desired replicas of 2 by TotalNumberOfQueuedAndInProgressWorkflowRuns	{"workflow_runs_completed": 0, "workflow_runs_in_progress": 0, "workflow_runs_queued": 2, "workflow_runs_unknown": 0, "namespace": "gha-runner-camunda", "kind": "runnerdeployment", "name": "aws-core-2-unstable", "horizontal_runner_autoscaler": "aws-core-2-unstable"}
2023-06-08T12:59:16Z	DEBUG	horizontalrunnerautoscaler	Calculated desired replicas of 2	{"horizontalrunnerautoscaler": "gha-runner-camunda/aws-core-2-unstable", "suggested": 2, "reserved": 0, "min": 0, "max": 20}


### Whole Runner Pod Logs

```shell
Runner Pod Logs are irrelevant since it's about the controller.

Additional Context

I found this discussion about the callback stuff.
I checked the code and I don't see a reason for keeping the callback behaviour and would suggest for both edge cases of allJobs == 0 and runID == 0 to simply ignore them rather than scaling up RunnerDeployments.

@Langleu Langleu added bug Something isn't working needs triage Requires review from the maintainers labels Jun 9, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 9, 2023

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@Langleu
Copy link
Contributor Author

Langleu commented Jun 9, 2023

TL;DR - My suggestion would be that we remove the callback behaviour from the listWorkflowJobs function and simply ignore these 2 edge cases since they shouldn't trigger a scale up.

It would be nice to have a maintainer look over it since I'm not sure whether it's needed for some undocumented reason.

@alexanderkranga
Copy link

This happens for our organization too and it causes us to spend additional money on EC2 instances that do not do any work. To help prioritize this we also opened a support ticket with GitHub and linked this issue in the ticket

@marcinkubica
Copy link

marcinkubica commented Jun 28, 2023

Same here. After completing ie. a workflow with 4 jobs, right after all pods are terminated I'm getting one extra runner provisioned which is causing GKE to scale up an extra worker node which then sits doing nothing till it's downscaled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Requires review from the maintainers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants