Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QG] Metrics: Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules #113

Closed
11 tasks done
Tracked by #112
tobiscr opened this issue Dec 27, 2023 · 3 comments
Assignees
Labels
area/control-plane Related to all activities around Kyma Control Plane

Comments

@tobiscr
Copy link
Contributor

tobiscr commented Dec 27, 2023

Description

With #11 we are able to make the Infrastructure Manager transparent and also simplify our operational life by establishing smart metrics and alerting rules.

Goals of this task is to identify which metrics / KPIs are business relevant and what the critical threshold for it are. We also have to define an action plan when such a threshold is reached which trigger a required action to bring our business back on track. Finally, alerting rules have to be configured which inform us as soon as one of the thresholds is reached.

AC:

  • Investigation: Verify how metrics are supported by Kubebuilder and how other teams are implementing them to reuse known pattern
  • Think about technical and business critical metrics / KPIs which give a clear indication of the quality and health of the Infrastructure Manager (see comment below)
    • Define the reason why this metric is relevant and what it represents.
      • Mandatory: metrics of REST client (especially egress traffic and their error rates etc.)
    • Define the threshold (min <> max etc.) which indicate an service degradation or health issue of the Infrastructure Manager. If a metric has no threshold, verify if it's for us still helpful to measure this value.
    • Specify the required action that has to be applied if a threshold is reached to recover the Infrastructure Manager into a productive and healthy state
    • Present the results in the team to collect the feedback of the colleagues.
  • the data present in Plutono should show the current state (our Custom Resource data)
  • Implement the identify business metrics in the Infrastructure Manager
    • Requirement from SRE: expose metrics of REST client (e.g. egress-traffic to Gardener or K8s in-cluster API) to be able to detect server-side / client-side errors.
  • Configure alerting rules which inform the team as soon as one of the thresholds is reached

Reasons

Improve operational quality and simplify on-call shifts by establish proper metrics/KPI measuring and alerting.

Extends #11

Attachments

@tobiscr tobiscr added kind/feature Categorizes issue or PR as related to a new feature. area/control-plane Related to all activities around Kyma Control Plane labels Dec 27, 2023
@tobiscr tobiscr removed the kind/feature Categorizes issue or PR as related to a new feature. label Jun 26, 2024
@tobiscr tobiscr changed the title Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules Metrics: Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules Sep 26, 2024
@tobiscr
Copy link
Contributor Author

tobiscr commented Sep 26, 2024

For the beginning we will measure only the amount of non-healthy Gardener clusters:

KPI Description Threshold which triggers an alert
Number of Gardener Clusters in non-healthy state Counting all RuntimeCRs which are in state failed >0

@koala7659
Copy link
Contributor

Mockup if dashboard idea:

Image

@koala7659
Copy link
Contributor

koala7659 commented Oct 3, 2024

Following metrics collected:

  • Runtime states as they are updated during sFnUpdateStatus() function
  • Unexpected stops of FSM when the machine stops before finishing processing with one of following functions :
    • updateStatusAndStop()
    • stop()
    • updateStatusAndStopWithError()

Additionally after some discussions I will also include to the dashboard some metrics from kubebuilder that we can be use for our performance tests

@tobiscr tobiscr changed the title Metrics: Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules [QG] Metrics: Identify and implement business critical metrics / KPIs, define an action plan and configure alerting rules Oct 4, 2024
@Disper Disper self-assigned this Oct 22, 2024
@Disper Disper closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane Related to all activities around Kyma Control Plane
Projects
None yet
Development

No branches or pull requests

4 participants