Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metrics for jobset #614

Merged
merged 3 commits into from
Jul 29, 2024

Conversation

googs1025
Copy link
Member

@googs1025 googs1025 commented Jul 4, 2024

Fix: #613 (comment)

only add FailedTotal CompletedTotal, two metrics, If more metrics are needed, I will add them

var (
	FailedTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Subsystem: constants.JobSetName,
			Name:      "jobset_failed_total",
			Help:      `The total number of jobset failed case`,
		}, []string{"jobsetName"},
	)

	CompletedTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Subsystem: constants.JobSetName,
			Name:      "jobset_completed_total",
			Help:      `The total number of jobset completed case`,
		}, []string{"jobsetName"},
	)
)

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 4, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 4, 2024
Copy link

netlify bot commented Jul 4, 2024

Deploy Preview for kubernetes-sigs-jobset ready!

Name Link
🔨 Latest commit 7ca93cb
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-jobset/deploys/66a2f15bacee500008b135c3
😎 Deploy Preview https://deploy-preview-614--kubernetes-sigs-jobset.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@googs1025 googs1025 changed the title feat: add metrics for jobset [WIP] feat: add metrics for jobset Jul 4, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 4, 2024
@googs1025 googs1025 changed the title [WIP] feat: add metrics for jobset feat: add metrics for jobset Jul 4, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 4, 2024
pkg/constants/constants.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics.go Outdated Show resolved Hide resolved
pkg/metrics/metrics_test.go Outdated Show resolved Hide resolved
@danielvegamyhre danielvegamyhre self-assigned this Jul 4, 2024
@googs1025
Copy link
Member Author

googs1025 commented Jul 21, 2024

The following test: simulates the experiment of whether the indicators are still correct after the controller is restarted

# Installing the prometheus operator
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-76469b7f8c-5wb8x   1/1     Running   0          12h

# install ServiceMonitor, So we can see prometheus-prometheus-0 in the jobset-system namespace
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-c49wz   2/2     Running   0          14m
prometheus-prometheus-0                      2/2     Running   0          11h
# run paralleljobs-before jobset
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl apply -f paralleljobs.yaml
jobset.jobset.x-k8s.io/paralleljobs-before created

We can use prometheus to view metrics
image

# Next, we delete the controller to simulate restarting
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-4rk4h   2/2     Running   0          2m13s
prometheus-prometheus-0                      2/2     Running   0          11h
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-before   Completed                  True                    35s

root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-c49wz   2/2     Running   0          16m
prometheus-prometheus-0                      2/2     Running   0          11h
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl delete pod jobset-controller-manager-76767b599b-c49wz -njobset-system
pod "jobset-controller-manager-76767b599b-c49wz" deleted
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-4rk4h   1/2     Running   0          4s
prometheus-prometheus-0                      2/2     Running   0          11h
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get pods -njobset-system
NAME                                         READY   STATUS    RESTARTS   AGE
jobset-controller-manager-76767b599b-4rk4h   2/2     Running   0          22s
prometheus-prometheus-0                      2/2     Running   0          11h

# run paralleljobs-after jobset
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# ls
exclusive-placement.yaml  jobset-with-network.yaml  max-restarts.yaml  paralleljobs.yaml  success-policy.yaml  ttl-after-finished.yaml

root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after                                                       8s
paralleljobs-before   Completed                  True                    2m57s
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after                                                       14s
paralleljobs-before   Completed                  True                    3m3s
root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after    Completed                  True                    20s
paralleljobs-before   Completed                  True                    3m9s


root@VM-0-5-ubuntu:/home/ubuntu/jobset/examples/simple# kubectl get jobset
NAME                  TERMINALSTATE   RESTARTS   COMPLETED   SUSPENDED   AGE
paralleljobs-after    Completed                  True                    20s
paralleljobs-before   Completed                  True                    3m9s

It can be seen from the indicators that when the simulation controller is restarted, the indicators still exist.
image

ps: if using kind, we can use port-forward, kubectl port-forward services/prometheus 39090:9090 --address 0.0.0.0
This allows us to access prometheus using a browser: http://<ecs public IP>:39090/graph

@googs1025
Copy link
Member Author

The yaml file used is as follows:

apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
  name: prometheus
  namespace: jobset-system
spec:
  serviceAccountName: prometheus1
  # ServiceMonitor 
  serviceMonitorSelector:
    #  label
    matchLabels:
      control-plane: controller-manager
  resources:
    requests:
      memory: 400Mi
  enableAdminAPI: false
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus
  namespace: jobset-system
spec:
  type: NodePort
  # kubectl port-forward services/prometheus  39090:9090 --address 0.0.0.0
  ports:
    - name: web
      nodePort: 30900
      port: 9090
      protocol: TCP
      targetPort: web
  selector:
    prometheus: prometheus

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus1
  namespace: jobset-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: prometheus1
rules:
  - apiGroups: [""]
    resources:
      - nodes
      - nodes/metrics
      - services
      - endpoints
      - pods
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources:
      - configmaps
    verbs: ["get"]
  - apiGroups:
      - networking.k8s.io
    resources:
      - ingresses
    verbs: ["get", "list", "watch"]
  - nonResourceURLs: ["/metrics"]
    verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: prometheus1
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus1
subjects:
  - kind: ServiceAccount
    name: prometheus1
    namespace: jobset-system

@googs1025
Copy link
Member Author

@danielvegamyhre /PTAL thanks!

@googs1025
Copy link
Member Author

In addition, I found that the docs does not have very clear steps for combining with prometheus operator. I will submit a PR to supplement it.

@danielvegamyhre
Copy link
Contributor

In addition, I found that the docs does not have very clear steps for combining with prometheus operator. I will submit a PR to supplement it.

This would be great, thanks!

@danielvegamyhre
Copy link
Contributor

@googs1025 in your example/test above, I see your query is for jobset_jobset_completed_total so I assume the format is {$SUBSYSTEM_NAME}_${METRIC_NAME}. I think maybe we should change the metric names to completed_total and failed_total, so the full metric is jobset_completed_total and jobset_failed_total, and we don't have a jobset_jobset_ prefix with the duplication. What do you think?

@googs1025
Copy link
Member Author

@googs1025 in your example/test above, I see your query is for jobset_jobset_completed_total so I assume the format is {$SUBSYSTEM_NAME}_${METRIC_NAME}. I think maybe we should change the metric names to completed_total and failed_total, so the full metric is jobset_completed_total and jobset_failed_total, and we don't have a jobset_jobset_ prefix with the duplication. What do you think?

done

@googs1025
Copy link
Member Author

ping @danielvegamyhre :)

@danielvegamyhre
Copy link
Contributor

Looks good to me. @kannon92 want to do a pass on this as well?

@kannon92
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 29, 2024
@danielvegamyhre
Copy link
Contributor

/approve

Thanks for the great work @googs1025!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielvegamyhre, googs1025

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 29, 2024
@k8s-ci-robot k8s-ci-robot merged commit 08ee737 into kubernetes-sigs:main Jul 29, 2024
13 checks passed
@danielvegamyhre danielvegamyhre mentioned this pull request Aug 19, 2024
20 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add monitoring metrics for jobset
4 participants