-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Github Webhook HRA] Not able to get it working... #377
Comments
I had the same issue, the HorizontalRunnerAutoscaler still requires one of the "normal" scaling types under this works for me: apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: runner-autoscaler
namespace: actions
spec:
minReplicas: 1
maxReplicas: 3
scaleTargetRef:
name: runner
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '1'
scaleDownThreshold: '0.5'
ScaleUpAdjustment: '1'
ScaleDownAdjustment: '1'
scaleUpTriggers:
- githubEvent:
checkRun:
types: ["created"]
status: "queued"
amount: 1
duration: "5m" |
facing same issue: After below error if add
|
@robwhitby Your solution is working, thanks a lot ! :D It should maybe be added on the Readme.md or be more explicit ? Like this :
ps: is anyone here, found a different solution to scale, without the Github Webhook ? I was wondering if the controller can be |
@theobolo I've not tried it but the EDIT actually, I think I miss-read it, I don't think you can run more than 1 controller. The only options you have really are:
|
@callum-tait-pbx Hello, thanks for your answer. I'm already aware of my possibilities, but i was wondering if i was missing something. I'm already using You can't run multiple controllers on the same cluster since they are checking "cluster-wide" resources like I was asking to @mumoshu if he thinks that adding namespacing capabilities on the controller sounds a doable / realistic / usefull ? x) I was thinking about running the controller with a For the moment unfortunately, it's not possible. My targets in term of scaling are :
The only solution to acheive that is by doing what i've described. The sweet spot for my controllers to acheive this, is to not go over 20 pods per HRA. Running 1 hour straight of tests (tests are running in 5 or 6 minutes), i'm only hitting Github API RateLimiting at the 56th or 57th minute, wich is great. The drawback with that method is that you need to use multiples of everything ... clusters / controllers / Github APP / etc.. This afternoon, i've tested the Github Webhook HRA, but still it's not fitting perfectly my case, because it reacts to The problem with that: when you have Matrix jobs with dynamic provisionning using a pre-job to populate the matrix, using the Github webhook HRA, is a little bit flaky, because since i'm running only 1 "Actions", it only scale by 1 runner. Yes i know... i can modify the HRA to say, "hey scale 2 or 3 or even more runners on a BTW, i'm very thanksfull for all the work done on this controller, it's working better and better everyday. <3 |
@theobolo Thanks for your detailed report! I'm still reading all the comments carefully and let me reply only one of them right now
I think it's still undocumented but the webhook-based autoscaler has a flag for configuring the namespace to be watched by the controller. We can probably add a similar flag to the controller-manager easily. I'll give it a shot soon. |
@theobolo Does I was wondering if your workflow looks like a matrix build occasionally seen in CI. If so, you could add a scaleUpTrigger of |
See #380 for the namespace restriction feature. |
@mumoshu Is metrics scaling is required in when we using GitHub webhook scaling? |
@avdhoot No. But, omitting Also, |
@mumoshu If we set
|
@avdhoot Thanks! Seems like a bug: I'll fix it asap. Also, the line number in your logs say that you're not using |
@avdhoot Could you share your manifest yaml for HRA and runner deployment? You seem to be missing Not saying your configuration is wrong. Just trying to fully understand your use-case. |
@avdhoot The gotcha here is that you need to define either If you're trying to omit |
Right. I am omitting
|
@avdhoot The fix should be available in the current |
@mumoshu ye. It is working.... Thanks 🙏 |
@mumoshu I just switched to Webhook based scaling and I am using the
config:
|
@mumoshu I think it is nothing to worry about. I checked the delivery id and it was |
@mumoshu Thanks a lot for your quick updates ! That's fantastic x) I'm gonna test the namespace flag this week and try to wrap up my multiple cluster setup within only a single one. I'll let you know if i have any issue with it ! BTW you'r absolutely right, my actions workflows are dynamic matrixes. Since we have a big mono-repo using lerna, we use a preliminary job on each pull requests tests to detect wich packages were modified with With the result of this first job, we are "populating" the matrix steps of the next job, by using the output of the first one. For example, sometimes unit tests are only running on 2 or 3 packages, sometimes it could be more than 20 or 30 tests for each pull requests. Actually each pull requests have 5 different matrix actions pipelines to run : Unit Tests / Lint tests / Build Tests / QA Tests / Integration Tests Anyway, i'll try to find a good setup first using mixed But, isn't there a way to just have something like :
So i'll be able to just scale 1 runner each time a new "job" is triggered, and not only when a whole "actions" is triggered. |
Great feedback! I'll do it soon. |
@theobolo I was in the impression that
In other words, can we say that there would be no problem if it works for a sync-period of 10 sec (or 1 sec)?
I wasn't sure what you mean by "workaround" here. To me, tweaking min/max replicas and scale-up triggers, metrics to keep the controller bringing up a sufficient number of runners for your use-case just seemed the right path. Probably your goal would be to free you from tweaking those settings whereas possible, perhaps by utilizing some domain knowledge(that can't be obtained from github API)? What if we've added a CLI application that can be executed on your actions workflow like |
@mumoshu I'm gonna answer your comment soon, by the way i'm not able to scale anymore using the canary image since your : 728829b merge. It seems that the controller is not seeing any registered runner when HRA check is done :
You can see that the HRA don't report any registered runner. My setup right now : Master helm chart = (v0.8.0) And thoses HRA / RunnerDeployment :
|
@theobolo Thanks for the feedback! That's interesting... 728829b should not affect |
@mumoshu Yes i'm really not sure about what's going on, but since yesterday commits on the canary version something seems broken. There is the result of
There is nothing suspect, i've just 1 runner running and it never scale. No errors on the controller. And maybe that's gonna help, there is my values.yaml :
I'm currently testing the "namespace scope" feature. ps: I've seen that i've an old I'm gonna recreate my cluster, to start from scratch with the latests charts / images / features. I'll update my comment in 30min. Edit : After recreating the Cluster from scratch, with master ref for Helm Charts, Canary images, and the RunnerDeployment & HRA linked in this comment, i've the same problem. So it's not related to stale/old resources. 👍 |
@theobolo I reviewd your values.yaml and it looks mostly good. One thing that I'm suspicious about is this part:
Seems like the chart is configured to create a secret with empty github creds. If the chart worked as told, it should creat an invalid secret and the controller should log authentication failures against GitHub API. But it has not? Would you mind verifying the content of the secret? |
@mumoshu oh ! i'm really sorry, i've just ripped off my tokens from the To be clear, that's not the problem. I'm still trying to figure out why it stopped working. edit: I changed the actions-controller image to the v0.17.0 version, and the HRA is now working properly, so there is something broken in the lastests canary versions. |
@mumoshu Wonderful, i'll give it a try as soon as it's merged ;) Thanks a lot ! |
PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector. Ref #377 (comment)
PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector. Ref #377 (comment)
@mumoshu It's working again :D ! Thanks a lot, let's test now 🚀 |
… enabled Relates to #379 (comment) Relates to #377 (comment) When you defined HRA.Spec.ScaleUpTriggers[] but HRA.Spec.Metrics[], the HRA controller will now enable ScaleUpTriggers alone and insteaed of automatically enabling TotalNumberOfQueuedAndInProgressWorkflowRuns. This allows you to use ScaleUpTriggers alone, so that the autoscaling is done without calling GitHub API at all, which should grealy decrease the change of GitHub API calls get rate-limited.
… enabled (#391) Relates to #379 (comment) Relates to #377 (comment) When you defined HRA.Spec.ScaleUpTriggers[] but HRA.Spec.Metrics[], the HRA controller will now enable ScaleUpTriggers alone and insteaed of automatically enabling TotalNumberOfQueuedAndInProgressWorkflowRuns. This allows you to use ScaleUpTriggers alone, so that the autoscaling is done without calling GitHub API at all, which should grealy decrease the change of GitHub API calls get rate-limited.
I got to think this is confusing and there's no actual benefit making it the default behavior. I've change the controller code, and since #391, omitting |
@mumoshu Once this is wrapped up could we get a formal release of the controller? I'd quite like to upgrade our setup at work and would rather version pin it rather than run the canary image. |
@callum-tait-pbx I'd definitely like to do so! Currently, I have only two remaining potential issues to investigate so probably we'll be having the next release after that. |
@mumoshu Hello Yusuke, i've tested heavily the new I can say that's working very well ;) I've launched 1 cluster with 5 namespaced controllers, each one in charge of 20 runners, with a sync-period of 1m. And ... it's amazing, that's it :D no more to say except thank you a lot again. By the way, i'm gonna test i little bit more the Github Webhook HRA on my side to find the best Autoscaling mecanism for my case. I think you'r right, the Now i'm able to scale up to 100 runners constantly without any Github API limitation, with 1 cluster ! 🥇 |
This looks awesome. @mumoshu the |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Hello everyone,
Since my scaling problems exposed on this issue #206, I've found an efficient workaround...
I'm using multiple kubernetes clusters (5 actually) with one github actions controller deployed on each one.
- type: PercentageRunnersBusy
method.It's working well, and it was the only solution i found to be able to run 100 runners concurently with the action runner controller.
@mumoshu
Btw, that's not why i'm here today. Since i've seen the new Github Webhook HRA feature, i absolutely need it to stop doing this kind of workaround and to be able to use the controller "at scale".
Unfortunately, i'm not able to get it working using the last Helm chart version 0.7.0.
I tried with :
latest/v0.17.0/canary
versions of the controller-image, and i'm using the 'master' branch CRDs.When i declare the HRA like this :
The github-actions-controller is crashing with this log :
I tried to delete :
to follow the README.md exemple, but the controller is not happy either and keeps saying to add
minReplicas
andmaxReplicas
to work.I know that this feature is in early stage, so it won't be suprised if this is not working yet, just wanted to be sure that you are aware of this :D
👍
The text was updated successfully, but these errors were encountered: