-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide scale-set listener metrics #2559
Conversation
@@ -170,3 +170,7 @@ template: | |||
# controllerServiceAccount: | |||
# namespace: arc-system | |||
# name: test-arc-gha-runner-scale-set-controller | |||
|
|||
metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to expose this controller per runner-scale-set?
or we just query the controller and create a service by default of the customer enables service monitor at the controller level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to see the easiest way to set it up
I wanted to create this PR as a POC to have something working and later improve it since it is already a big one
But I'll convert this to draft, base the controller metrics on it to explore the easiest way to combine both of them and create a follow-up PR for controller metrics
charts/gha-runner-scale-set-controller/templates/manager_exported_service.yaml
Outdated
Show resolved
Hide resolved
2d2d70f
to
525437f
Compare
# To turn off metrics, specify empty strings for controllerAddr and listenerAddr | ||
metrics: | ||
controllerAddr: ":8080" | ||
listenerAddr: ":8080" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to use the fully qualified name?
Ex:
controllerManagerAddr
autoscalinglistenerAddr
if err != nil { | ||
log.Error(err, "Github Config URL is invalid", "URL", githubConfigURL) | ||
// stop reconciling on this object | ||
return ctrl.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is going to trigger reconcile again, while if the URL is invalid, we can't really do anything about it, so we would be just wasting cycles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we not return and keep the reconcile loop?
basically, the publish metrics is best effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is right, but the failure is due to the wrong URL. We should never reach this point, the listener would fail, the autoscaling runner set should fail before this. I left it as a precaution, but we should never reach this point. If we are failing to parse GitHub URL, we have a bigger problem somewhere and just in case, this would serve as an indicator. But I don't think we should ever reach this condition
controllers/actions.github.com/ephemeralrunnerset_controller.go
Outdated
Show resolved
Hide resolved
charts/gha-runner-scale-set-controller/templates/deployment.yaml
Outdated
Show resolved
Hide resolved
8745fc2
to
3121b06
Compare
Do we want to add E2E tests to this PR or shall we have another task just for that? |
6e1417d
to
4afe34d
Compare
4afe34d
to
52a354a
Compare
Co-authored-by: Tingluo Huang <tingluohuang@github.com>
185c292
to
c46829e
Compare
TODOs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brilliant work @nikola-jokic this is an exciting ship!
Co-authored-by: Tingluo Huang <tingluohuang@github.com> Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
@nikola-jokic any chance of getting job_queue_duration_seconds? This seems like a very good metric to monitor the health of our configuration/service quality? Awesome work ❤️ |
Hey @chlunde, thank you for your kind words! We are planning to introduce this metric in future releases. While testing we noticed a small issue when the scale set transitions from being active to idle, so we decided not to publish this metric until it is fixed |
Context
Fixes https://github.com/github/c2c-actions/issues/6816
Metrics are being introduced to the new autoscaling runner scale set mode. These metrics are separate from those published by the legacy modes.
The following is the list of metrics the controller-manager and listener pods will be emitting.
runs-on
matching the runner scale set name. Jobs are not yet assigned to the runner scale set.