Provide scale-set listener metrics #2559

nikola-jokic · 2023-05-03T12:46:50Z

Context

Fixes https://github.com/github/c2c-actions/issues/6816

Metrics are being introduced to the new autoscaling runner scale set mode. These metrics are separate from those published by the legacy modes.

The following is the list of metrics the controller-manager and listener pods will be emitting.

Owner	Metric	Type	Description
controller-manager	pending_ephemeral_runners	gauge	Number of ephemeral runners in a pending state.
controller-manager	running_ephemeral_runners	gauge	Number of ephemeral runners in a running state.
controller-manager	failed_ephemeral_runners	gauge	Number of ephemeral runners in a failed state.
listener	available_jobs	gauge	Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned to the runner scale set.
listener	acquired_jobs	gauge	Number of jobs acquired by the runner scale set.
listener	assigned_jobs	gauge	Number of jobs assigned to the runner scale set.
listener	running_jobs	gauge	Number of jobs running (or about to be run).
listener	registered_runners	gauge	Number of runners registered by the runner scale set.
listener	busy_runners	gauge	Number of registered runners currently running a job.
listener	min_runners	gauge	Minimum number of runners configured for the runner scale set.
listener	max_runners	gauge	Maximum number of runners configured for the runner scale set.
listener	desired_runners	gauge	Number of runners desired (scale up / down target) by the runner scale set.
listener	idle_runners	gauge	Number of registered runners not running a job.
listener	started_jobs_total	counter	Total number of jobs started since the listener became ready (will reset on pod restart).
listener	completed_jobs_total	counter	Total number of jobs completed since the listener became ready (will reset on pod restart).
listener	job_queue_duration_seconds	histogram	Time spent waiting for workflow jobs to get assigned to the runner scale set after queueing (in seconds).
listener	job_startup_duration_seconds	histogram	Time spent waiting for workflow job to get started on the runner owned by the runner scale set (in seconds).
listener	job_execution_duration_seconds	histogram	Time spent executing workflow jobs by the runner scale set (in seconds).

controllers/actions.github.com/resourcebuilder.go

cmd/githubrunnerscalesetlistener/autoScalerService.go

TingluoHuang · 2023-05-04T18:58:58Z

charts/gha-runner-scale-set/values.yaml

@@ -170,3 +170,7 @@ template:
 # controllerServiceAccount:
 #   namespace: arc-system
 #   name: test-arc-gha-runner-scale-set-controller
+
+metrics:


do we need to expose this controller per runner-scale-set?
or we just query the controller and create a service by default of the customer enables service monitor at the controller level.

I'll try to see the easiest way to set it up
I wanted to create this PR as a POC to have something working and later improve it since it is already a big one
But I'll convert this to draft, base the controller metrics on it to explore the easiest way to combine both of them and create a follow-up PR for controller metrics ☺️

charts/gha-runner-scale-set-controller/templates/manager_exported_service.yaml

cmd/githubrunnerscalesetlistener/metrics.go

TingluoHuang · 2023-05-29T02:25:29Z

charts/gha-runner-scale-set-controller/values.yaml

+# To turn off metrics, specify empty strings for controllerAddr and listenerAddr
+metrics:
+  controllerAddr: ":8080"
+  listenerAddr: ":8080"


Do we want to use the fully qualified name?
Ex:

controllerManagerAddr

autoscalinglistenerAddr

main.go

TingluoHuang · 2023-05-29T02:29:28Z

controllers/actions.github.com/ephemeralrunnerset_controller.go

+		if err != nil {
+			log.Error(err, "Github Config URL is invalid", "URL", githubConfigURL)
+			// stop reconciling on this object
+			return ctrl.Result{}, nil


return error?

It is going to trigger reconcile again, while if the URL is invalid, we can't really do anything about it, so we would be just wasting cycles

should we not return and keep the reconcile loop?
basically, the publish metrics is best effort?

That is right, but the failure is due to the wrong URL. We should never reach this point, the listener would fail, the autoscaling runner set should fail before this. I left it as a precaution, but we should never reach this point. If we are failing to parse GitHub URL, we have a bigger problem somewhere and just in case, this would serve as an indicator. But I don't think we should ever reach this condition

controllers/actions.github.com/ephemeralrunnerset_controller.go

controllers/actions.github.com/metrics/metrics.go

charts/gha-runner-scale-set-controller/templates/deployment.yaml

cmd/githubrunnerscalesetlistener/metrics.go

Link- · 2023-06-13T12:28:22Z

Do we want to add E2E tests to this PR or shall we have another task just for that?

…Register

…ather than lazy

…o listenAndServe

Co-authored-by: Tingluo Huang <tingluohuang@github.com>

charts/gha-runner-scale-set-controller/values.yaml

Link- · 2023-08-18T13:47:19Z

TODOs:

gha_registered_runners continues to report on the last value the listener receives
- This requires a server side change
Revisit the labels and make sure the common labels are applied
Expose number of listeners gauge with all the common labels applied

Link-

Brilliant work @nikola-jokic this is an exciting ship!

Co-authored-by: Tingluo Huang <tingluohuang@github.com> Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>

chlunde · 2023-10-30T11:53:57Z

@nikola-jokic any chance of getting job_queue_duration_seconds? This seems like a very good metric to monitor the health of our configuration/service quality? Awesome work ❤️

nikola-jokic · 2023-11-03T10:01:05Z

Hey @chlunde, thank you for your kind words! We are planning to introduce this metric in future releases. While testing we noticed a small issue when the scale set transitions from being active to idle, so we decided not to publish this metric until it is fixed ☺️

nikola-jokic requested review from mumoshu, toast-gear and a team as code owners May 3, 2023 12:46

TingluoHuang reviewed May 4, 2023

View reviewed changes

controllers/actions.github.com/resourcebuilder.go Outdated Show resolved Hide resolved

TingluoHuang reviewed May 4, 2023

View reviewed changes

cmd/githubrunnerscalesetlistener/autoScalerService.go Show resolved Hide resolved

TingluoHuang reviewed May 4, 2023

View reviewed changes

cmd/githubrunnerscalesetlistener/autoScalerService.go Outdated Show resolved Hide resolved

TingluoHuang reviewed May 4, 2023

View reviewed changes

TingluoHuang reviewed May 8, 2023

View reviewed changes

charts/gha-runner-scale-set-controller/templates/manager_exported_service.yaml Outdated Show resolved Hide resolved

nikola-jokic marked this pull request as draft May 8, 2023 12:15

Link- added the gha-runner-scale-set Related to the gha-runner-scale-set mode label May 9, 2023

nikola-jokic force-pushed the nikola-jokic/metrics branch 2 times, most recently from 2d2d70f to 525437f Compare May 23, 2023 10:15

nikola-jokic commented May 23, 2023

View reviewed changes

cmd/githubrunnerscalesetlistener/metrics.go Outdated Show resolved Hide resolved

nikola-jokic commented May 23, 2023

View reviewed changes

cmd/githubrunnerscalesetlistener/metrics.go Outdated Show resolved Hide resolved

nikola-jokic marked this pull request as ready for review May 23, 2023 10:19

TingluoHuang reviewed May 29, 2023

View reviewed changes

main.go Outdated Show resolved Hide resolved

TingluoHuang reviewed May 29, 2023

View reviewed changes

controllers/actions.github.com/ephemeralrunnerset_controller.go Outdated Show resolved Hide resolved

TingluoHuang reviewed May 29, 2023

View reviewed changes

controllers/actions.github.com/metrics/metrics.go Show resolved Hide resolved

TingluoHuang reviewed May 30, 2023

View reviewed changes

charts/gha-runner-scale-set-controller/templates/deployment.yaml Outdated Show resolved Hide resolved