feat: add metrics for jobset #614

googs1025 · 2024-07-04T01:33:36Z

only add FailedTotal CompletedTotal, two metrics, If more metrics are needed, I will add them

var (
	FailedTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Subsystem: constants.JobSetName,
			Name:      "jobset_failed_total",
			Help:      `The total number of jobset failed case`,
		}, []string{"jobsetName"},
	)

	CompletedTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Subsystem: constants.JobSetName,
			Name:      "jobset_completed_total",
			Help:      `The total number of jobset completed case`,
		}, []string{"jobsetName"},
	)
)

netlify · 2024-07-04T01:33:54Z

✅ Deploy Preview for kubernetes-sigs-jobset ready!

Name	Link
🔨 Latest commit	`7ca93cb`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66a2f15bacee500008b135c3
😎 Deploy Preview	https://deploy-preview-614--kubernetes-sigs-jobset.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

pkg/constants/constants.go

pkg/metrics/metrics.go

pkg/metrics/metrics_test.go

pkg/metrics/metrics.go

googs1025 · 2024-07-21T03:51:12Z

The following test: simulates the experiment of whether the indicators are still correct after the controller is restarted

# Installing the prometheus operator
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-76469b7f8c-5wb8x   1/1     Running   0          12h

# install ServiceMonitor, So we can see prometheus-prometheus-0 in the jobset-system namespace
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-c49wz   2/2     Running   0          14m
prometheus-prometheus-0                      2/2     Running   0          11h

# run paralleljobs-before jobset
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl apply -f paralleljobs.yaml
jobset.jobset.x-k8s.io/paralleljobs-before created

We can use prometheus to view metrics

# Next, we delete the controller to simulate restarting
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-4rk4h   2/2     Running   0          2m13s
prometheus-prometheus-0                      2/2     Running   0          11h
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-before   Completed                  True                    35s

root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-c49wz   2/2     Running   0          16m
prometheus-prometheus-0                      2/2     Running   0          11h
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl delete pod jobset-controller-manager-76767b599b-c49wz -njobset-system
pod "jobset-controller-manager-76767b599b-c49wz" deleted
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-4rk4h   1/2     Running   0          4s
prometheus-prometheus-0                      2/2     Running   0          11h
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-4rk4h   2/2     Running   0          22s
prometheus-prometheus-0                      2/2     Running   0          11h

# run paralleljobs-after jobset
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# ls
exclusive-placement.yaml  jobset-with-network.yaml  max-restarts.yaml  paralleljobs.yaml  success-policy.yaml  ttl-after-finished.yaml

root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after                                                       8s
paralleljobs-before   Completed                  True                    2m57s
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after                                                       14s
paralleljobs-before   Completed                  True                    3m3s
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after    Completed                  True                    20s
paralleljobs-before   Completed                  True                    3m9s


root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after    Completed                  True                    20s
paralleljobs-before   Completed                  True                    3m9s

It can be seen from the indicators that when the simulation controller is restarted, the indicators still exist.

ps: if using kind, we can use port-forward, kubectl port-forward services/prometheus 39090:9090 --address 0.0.0.0
This allows us to access prometheus using a browser: http://<ecs public IP>:39090/graph

googs1025 · 2024-07-21T03:52:53Z

The yaml file used is as follows:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: jobset-system
spec:
  serviceAccountName: prometheus1
  # ServiceMonitor 
  serviceMonitorSelector:
    #  label
    matchLabels:
      control-plane: controller-manager
  resources:
    requests:
      memory: 400Mi
  enableAdminAPI: false
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: jobset-system
spec:
  type: NodePort
  # kubectl port-forward services/prometheus  39090:9090 --address 0.0.0.0
  ports:
    - name: web
      nodePort: 30900
      port: 9090
      protocol: TCP
      targetPort: web
  selector:
    prometheus: prometheus

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus1
  namespace: jobset-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus1
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - configmaps
    verbs: ["get"]
  - apiGroups:
      - networking.k8s.io
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus1
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus1
subjects:
  - kind: ServiceAccount
    name: prometheus1
    namespace: jobset-system

googs1025 · 2024-07-21T03:57:41Z

@danielvegamyhre /PTAL thanks!

googs1025 · 2024-07-21T04:00:29Z

In addition, I found that the docs does not have very clear steps for combining with prometheus operator. I will submit a PR to supplement it.

danielvegamyhre · 2024-07-25T22:49:42Z

In addition, I found that the docs does not have very clear steps for combining with prometheus operator. I will submit a PR to supplement it.

This would be great, thanks!

danielvegamyhre · 2024-07-25T22:58:18Z

@googs1025 in your example/test above, I see your query is for jobset_jobset_completed_total so I assume the format is {$SUBSYSTEM_NAME}_${METRIC_NAME}. I think maybe we should change the metric names to completed_total and failed_total, so the full metric is jobset_completed_total and jobset_failed_total, and we don't have a jobset_jobset_ prefix with the duplication. What do you think?

googs1025 · 2024-07-26T00:44:46Z

@googs1025 in your example/test above, I see your query is for jobset_jobset_completed_total so I assume the format is {$SUBSYSTEM_NAME}_${METRIC_NAME}. I think maybe we should change the metric names to completed_total and failed_total, so the full metric is jobset_completed_total and jobset_failed_total, and we don't have a jobset_jobset_ prefix with the duplication. What do you think?

done

googs1025 · 2024-07-29T00:26:11Z

ping @danielvegamyhre :)

danielvegamyhre · 2024-07-29T18:18:01Z

Looks good to me. @kannon92 want to do a pass on this as well?

kannon92 · 2024-07-29T18:54:01Z

/lgtm

danielvegamyhre · 2024-07-29T21:02:21Z

/approve

Thanks for the great work @googs1025!

k8s-ci-robot · 2024-07-29T21:02:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre, googs1025

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [danielvegamyhre]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 4, 2024

k8s-ci-robot requested review from danielvegamyhre and kannon92 July 4, 2024 01:33

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 4, 2024

googs1025 changed the title ~~feat: add metrics for jobset~~ [WIP] feat: add metrics for jobset Jul 4, 2024

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 4, 2024

googs1025 force-pushed the add_metrics branch from 5324566 to 3d47919 Compare July 4, 2024 01:35

feat: add metrics for jobset

434f4d2

googs1025 force-pushed the add_metrics branch from 3d47919 to 434f4d2 Compare July 4, 2024 01:37

googs1025 changed the title ~~[WIP] feat: add metrics for jobset~~ feat: add metrics for jobset Jul 4, 2024

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 4, 2024

danielvegamyhre reviewed Jul 4, 2024

View reviewed changes

danielvegamyhre self-assigned this Jul 4, 2024

fix some typo and func name

4e96c74

googs1025 force-pushed the add_metrics branch from d8495da to 4e96c74 Compare July 5, 2024 07:29

danielvegamyhre reviewed Jul 11, 2024

View reviewed changes

pkg/metrics/metrics.go Show resolved Hide resolved

remove the metric name prefix jobset

7ca93cb

googs1025 mentioned this pull request Jul 26, 2024

docs: added use cases for using prometheus-operator #626

Closed

googs1025 requested a review from danielvegamyhre July 27, 2024 01:30

k8s-ci-robot assigned kannon92 Jul 29, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2024

k8s-ci-robot merged commit 08ee737 into kubernetes-sigs:main Jul 29, 2024
13 checks passed

danielvegamyhre mentioned this pull request Aug 19, 2024

Release v0.6.0 #655

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add metrics for jobset #614

feat: add metrics for jobset #614

googs1025 commented Jul 4, 2024 •

edited

Loading

netlify bot commented Jul 4, 2024 •

edited

Loading

googs1025 commented Jul 21, 2024 •

edited

Loading

googs1025 commented Jul 21, 2024

googs1025 commented Jul 21, 2024

googs1025 commented Jul 21, 2024

danielvegamyhre commented Jul 25, 2024

danielvegamyhre commented Jul 25, 2024

googs1025 commented Jul 26, 2024

googs1025 commented Jul 29, 2024

danielvegamyhre commented Jul 29, 2024

kannon92 commented Jul 29, 2024

danielvegamyhre commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

feat: add metrics for jobset #614

feat: add metrics for jobset #614

Conversation

googs1025 commented Jul 4, 2024 • edited Loading

netlify bot commented Jul 4, 2024 • edited Loading

✅ Deploy Preview for kubernetes-sigs-jobset ready!

googs1025 commented Jul 21, 2024 • edited Loading

The following test: simulates the experiment of whether the indicators are still correct after the controller is restarted

googs1025 commented Jul 21, 2024

googs1025 commented Jul 21, 2024

googs1025 commented Jul 21, 2024

danielvegamyhre commented Jul 25, 2024

danielvegamyhre commented Jul 25, 2024

googs1025 commented Jul 26, 2024

googs1025 commented Jul 29, 2024

danielvegamyhre commented Jul 29, 2024

kannon92 commented Jul 29, 2024

danielvegamyhre commented Jul 29, 2024

k8s-ci-robot commented Jul 29, 2024

googs1025 commented Jul 4, 2024 •

edited

Loading

netlify bot commented Jul 4, 2024 •

edited

Loading

googs1025 commented Jul 21, 2024 •

edited

Loading