Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866

bwplotka · 2021-08-12T11:44:29Z

As an effect of the collaboration of the Monitoring Team with Service Delivery org in our Monitoring Enhancement Working Group, we on behalf of our team we would like to propose with @fpetkovski a solution for immediate needs for Managed Services (SaaS/Layered Services/Addons).

Feedback is very welcome!

PTAL @alanmoran @smarterclayton @kbsingh @pb82 @simonpasquier @brampling

Please keep all offline discussions in this PR so everyone is up-to-date.

Signed-by: Bartlomiej Plotka bwplotka@gmail.com

@fpetkovski

…and future needs. As an effect of the collaboration of Monitoring Team with Service Delivery org in our Monitoring Enhancement Working Group, with @fpetkovski would like to propose solution for immdiate needs for Managed Services (SaaS/Layered Services/Addons). Feedback very welcome! Please keep all offline discussions in this PR so everyone is up-to-date. Signed-by: Bartlomiej Plotka <bwplotka@gmail.com>

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka · 2021-08-14T14:34:38Z

Got bit upset by linter, produced issue: #869 (: (with some ideas how to improve it)

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka · 2021-08-16T16:00:33Z

/test markdownlint

enhancements/monitoring/monitoring-stack-operator.md

simonpasquier

This is very good! Thanks for the effort.

enhancements/monitoring/monitoring-stack-operator.md

simonpasquier · 2021-08-17T08:36:40Z

enhancements/monitoring/monitoring-stack-operator.md

+
+For simplicity reasons, routing alerts that come from managed service monitoring stacks will be handled by a single Alertmanager.
+We can take advantage of the recently added <code>[AlertmanagerConfig](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#alertmanagerconfig)</code> and allow MTO/MTSRE to create their own routing configuration in a centrally deployed Alertmanager instance.
+Finally, the MTSRE team could centrally configure receivers in one place for all managed services and service owners would simply use those receivers in their routing rules. It is worth noting that the last feature depends on [https://github.com/prometheus-operator/prometheus-operator/issues/4033](https://github.com/prometheus-operator/prometheus-operator/issues/4033).


I don't see this issue as a blocker since it's already possible to configure the global Alertmanager configuration with a secret. Having said that supporting the same structures for global and per-namespace Alertmanager configurations would be nice indeed.

@simonpasquier is it possible to use both the secret and an AlertmanagerConfig? I thought that with the AlertmanagerConfig everything gets scoped to the namespace from which the config is coming. If that is the case, how would a tenant reference a receiver from the common secret in their AlertmanagerConfig?

enhancements/monitoring/monitoring-stack-operator.md

aditya-konarde

Thanks a lot for working on this! ❤️

Added some suggestions and clarifying questions :)

enhancements/monitoring/monitoring-stack-operator.md

matej-g

Solid chunk of work 💪

Mostly minor suggestions from my side

enhancements/monitoring/monitoring-stack-operator.md

jan--f

Nice, this sounds really good! Left a few comments/suggestions inline.

I'm not sure whether the following is out of scope for this document or not, but should we consider maintenance of this somewhere? Maybe this risk section?
I assume the Monitoring team will maintain this. Adding an additional operator to the teams maintenance workload seems worth mentioning.

enhancements/monitoring/monitoring-stack-operator.md

simonpasquier · 2021-09-24T13:03:32Z

/lgtm

jeremyeder · 2021-09-24T13:55:27Z

@cooktheryan our robot overlords have you set as the next reviewer. Is that correct?

cooktheryan · 2021-09-24T14:33:33Z

@jeremyeder I wish that was the case but I don't believe I am the correct individual to be making that decision. This should be tagged for someone else

enhancements/monitoring/monitoring-stack-operator.md

sugarraysam

Hello, I am part of the MTSRE team and would like to thank you for putting this proposal together. This is awesome and I am looking forward to working together in order to make this new project happen.

I posted a few comments/suggestions, and they all have the same two underlying goals:

Expose an SLO based interface to our tenants
Make sure we perform active reconciliation on this interface (operator-pattern)

This is very important as the end goal here is to be able to measure SLOs/SLIs and error budgets. We should not ask our MTO to tweak or touch any prometheus/alertmanager/grafana configs. They need to spend their time doing feature development and bug fixing.

There exists various tools out there that could help us abstract away prometheus/alertmanager and grafana configs:

Sloth: https://github.com/slok/sloth
Keptn: https://github.com/keptn/keptn

sugarraysam · 2021-10-05T19:26:09Z

enhancements/monitoring/monitoring-stack-operator.md

+
+#### Story 1
+
+As a MTSRE/MTO, I would like to define ServiceMonitors that will scrape metrics from my managed service by a Prometheus-compatible system.


I want to challenge the interface we want to expose. Instead of ServiceMonitor it would be so much better to provide an SLO-based interface. The reasoning is that we want to abstract as much of the monitoring stack as we can. We want MTO to focus on SLOs and SLIs, not being Prometheus experts. The story I suggest is this one:

As a MTSRE/MTO, I would like to define a SLO based CR that will create Prometheus-compatible config to scrape metrics from my managed service by a Prometheus-compatible system.

Example implementation:

PrometheusServiceLevel CR from the sloth project: https://sloth.dev/examples/kubernetes/getting-started/

sugarraysam · 2021-10-05T19:28:43Z

enhancements/monitoring/monitoring-stack-operator.md

+
+#### Story 3
+
+As an MTSRE/MTO, I would like to create alerting and recording rules for Prometheus metrics exposed by my managed service.


Again proposing to abstract alerting and recording rules away, in favor of PrometheusServiceLevel CR or something similar (sloth):

As an MTSRE/MTO, I would like to create SLO-oriented CRs, that will translate to Prometheus alerting and recording rules for my managed service.

Example implementation:

PrometheusServiceLevel CR from the sloth project: https://sloth.dev/examples/kubernetes/getting-started/

sugarraysam · 2021-10-05T19:30:30Z

enhancements/monitoring/monitoring-stack-operator.md

+
+#### Story 4
+
+As an MTSRE, I would like to configure routing for alerts defined on Prometheus metrics exposed by my SaaS/Addon.


Again proposing to abstract routing away, in favor of PrometheusServiceLevel CR or something similar (sloth):

As an MTSRE/MTO, I would like to create SLO-oriented CRs, that will configure alerts defined on Prometheus metrics exposed by my SaaS/Addon.

Example implementation:

PrometheusServiceLevel CR from the sloth project: https://sloth.dev/examples/kubernetes/getting-started/

sugarraysam · 2021-10-05T19:33:10Z

enhancements/monitoring/monitoring-stack-operator.md

+
+#### Story 6
+
+As an MTO, I would like to create dashboards for certain metrics across all clusters where the SaaS/Addon is deployed or used.


I think SLO dashboard can be abstracted as well. Creating grafana dashboards is extremely time consuming, and as an MTSRE engineer, I want to have an homogeneous environment when switching from one addon to another. Plus, we do not want MTO to become grafana dashboard experts, or spend any time on this. This is why I would highly suggest:

As an MTO, I would like to provide an SLO based CR, that automatically creates dashboards that allow me to visualize SLO/SLI and error budgets for certain metrics across all clusters where the SaaS/Addon is deployed or used.

Example implementation:

Sloth SLO oriented grafana dashboards: https://sloth.dev/dashboards/

sugarraysam · 2021-10-05T19:34:58Z

enhancements/monitoring/monitoring-stack-operator.md

+#### Story 7
+
+As a Customer, I can use UWM without seeing Addons metrics, alerting or recording configurations.
+


Adding an extra requirement for MTSRE/MTO:

### Story 8 As an MTO, I would like to provide an SLO based CR, that is actively reconciled by a kubernetes-based operator, so that I can update my SLO/SLI configurations smoothly.

Example implementation:

sloth operator/controller: https://sloth.dev/usage/kubernetes/

fpetkovski · 2021-10-06T07:04:56Z

Hi @sugarraysam, thanks for the suggestions, having a higher-level abstraction on top of ServiceMonitors and Dashboards is an awesome idea. Since it comes a bit late into the proposal process and we've already started work on this project, could you create an issue with your suggestions instead? This is the repository for the project: https://github.com/rhobs/monitoring-stack-operator

nb-ohad · 2021-10-06T08:20:28Z

@fpetkovski @sugarraysam
I think a higher-level abstraction is a nice idea on paper but the reality is that the SRE team or even the tenants' eng teams are not always the ones who define the Monitoring resources.

In the cases I have seen so far, including our own case - ODF MS, the underlying product (that is backing up the service) is the one responsible to deploy/manage/reconcile the monitoring resources (ServiceMonitors, PodMonitors, and PrometheusRuels).

If we go with the proposed abstraction as the only way to configure the scraping we will create a situation where these kinds of tenants will have to "jump through many hops", and introduce some ugly workarounds, just to configure scraping.

If we look at the big picture, taking into account all the deployed components (from infrastructure to the products themselves), we will see that all these workarounds are just creating a less stable solution.

fpetkovski · 2021-10-06T08:49:00Z

I think that we don't have to make a mutually-exclusive decision. We can still support all the existing APIs and add a few ones on top.

sugarraysam · 2021-10-06T17:55:00Z

@nb-ohad @fpetkovski I like the idea of exposing multiple interfaces, it makes it flexible and allows a transitioning period. I understand it might be some work to migrate the existing monitoring to an SLO-based approach, but in the long-run this is what we should aim for. Monitoring without SLO does not make much sense. Why are you alerting on X or Y? How does it relate to your users? CPU usage and memory consumption without context are not good measures. The whole industry is shifting towards SLO based monitoring.

Will follow up with a feature request (issue) -> https://github.com/rhobs/monitoring-stack-operator

(edit: this is what app-sre is doing in app-interface as well /schemas/app-sre/slo-document-1.yml)

nb-ohad · 2021-10-06T18:20:21Z

@sugarraysam I get your point, but your proposal is narrowing the solution to a single use case which is to facilitate service SLO tracking and visibility. There are other use cases for monitoring a managed service, for example, providing product-defined dashboard / OCP console plugins to the customer (as ODF is doing). The metrics that are needed to accomplish that are unrelated to SLO, the SLIs, or even the serviceability of the service.

Saying that I don't see the proposed interface as a replacement of the "older" interface, maybe as an enhancement. This means that we should not put any expectation for a transition to happen for some offerings.

bwplotka · 2021-10-07T12:07:19Z

I updated the enhancement with the following changes:

Added Graduation criteria
Added story for exposing metrics to Console and how to do it.
Added mention about further steps around soft tenancy potential in other iterations
Fixed typos

As per your comments @sugarraysam thanks for your idea. It is being discussed in rhobs/observability-operator#13 and I think it's solid thing to implement on top of ServiceMonitors. I don't think we can remove everything else which is not SLO based. Monitoring is more than SLO. It's also about troubleshooting, sometimes billing, data analysis of features etc. So I would not trim all down to SLO usage only, even as MTSRE you have right to ignore anything else. Does it make sense? (:

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

simonpasquier · 2021-10-07T12:41:47Z

enhancements/monitoring/monitoring-stack-operator.md


 #### Tech Preview -> GA

-TBD
+* More options exposed in Monitoring Stack resource
+* We have onboard at lease few Addons on this feature.


"at lease" -> "at least"

enhancements/monitoring/monitoring-stack-operator.md

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

simonpasquier

/lgtm

bwplotka · 2021-10-08T13:00:17Z

What's missing on this to get this merged? 🤗

jeremyeder · 2021-10-08T13:47:54Z

I scanned this whole thing again and since the SLO abstraction (which I also highly support) is scoped out, I would lgtm this as-is.

simonpasquier · 2021-10-19T14:11:54Z

/approve

I think that everybody had a chance to share their feedback and the current version reflects as much as we can where we want to/should go. Thank you all for the constructive comments and thanks @bwplotka and @fpetkovski for this significant proposal :)

openshift-ci · 2021-10-19T14:13:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~enhancements/monitoring/OWNERS~~ [simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from cooktheryan and squeed August 12, 2021 11:45

bwplotka force-pushed the monitoring-mt-proposal branch from 95a7553 to ce1f129 Compare August 12, 2021 13:18

fpetkovski and others added 4 commits August 12, 2021 15:49

Fix typos

5027f71

Fixed markdown lint errors.

d4d3dff

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Fixing lint.

761c0a0

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Fix lint again.

2676d50

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka mentioned this pull request Aug 14, 2021

Improve markdown linter; Optimize for human time vs brittle manual markdown formatting. #869

Closed

bwplotka added 5 commits August 14, 2021 15:43

Fix lint yet again.

b24f28c

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Fix lint again.

188be57

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Fix lint again.

a226212

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Fix.

296c56d

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

Another fix.

b554bd5

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

smarterclayton reviewed Aug 16, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved

dofinn reviewed Aug 17, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Show resolved Hide resolved

simonpasquier reviewed Aug 17, 2021

View reviewed changes

aditya-konarde reviewed Aug 17, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved

aditya-konarde reviewed Aug 17, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved

aditya-konarde suggested changes Aug 17, 2021

View reviewed changes

matej-g reviewed Aug 17, 2021

View reviewed changes

anishasthana reviewed Aug 17, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved

anishasthana reviewed Aug 17, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Show resolved Hide resolved

enhancements/monitoring/monitoring-stack-operator.md Show resolved Hide resolved

jeremyeder reviewed Aug 18, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Show resolved Hide resolved

jeremyeder reviewed Aug 18, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Show resolved Hide resolved

jeremyeder reviewed Aug 18, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved

jeremyeder reviewed Aug 18, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Show resolved Hide resolved

jan--f reviewed Aug 18, 2021

View reviewed changes

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2021

arajkumar reviewed Sep 30, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved

arajkumar reviewed Oct 1, 2021

View reviewed changes

enhancements/monitoring/monitoring-stack-operator.md Show resolved Hide resolved

NissesSenap mentioned this pull request Oct 5, 2021

Avoiding Scope Creep of developing "Grafana Products Operator" grafana/grafana-operator#497

Closed

sugarraysam reviewed Oct 5, 2021

View reviewed changes

sugarraysam mentioned this pull request Oct 6, 2021

Support and expose an SLO-based monitoring interface rhobs/observability-operator#13

Open

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2021

bwplotka force-pushed the monitoring-mt-proposal branch from f111642 to 969fb38 Compare October 7, 2021 12:03

Updated.

a46d443

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

bwplotka force-pushed the monitoring-mt-proposal branch from 969fb38 to a46d443 Compare October 7, 2021 12:12

simonpasquier reviewed Oct 7, 2021

View reviewed changes

Fixed typo.

dec41a8

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

simonpasquier reviewed Oct 7, 2021

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2021

openshift-merge-robot merged commit ad49b8e into openshift:master Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866

Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866

bwplotka commented Aug 12, 2021 •

edited

Loading

bwplotka commented Aug 14, 2021 •

edited

Loading

bwplotka commented Aug 16, 2021

simonpasquier left a comment

simonpasquier Aug 17, 2021

bwplotka Aug 19, 2021

fpetkovski Aug 30, 2021

aditya-konarde left a comment

matej-g left a comment

jan--f left a comment

simonpasquier commented Sep 24, 2021

jeremyeder commented Sep 24, 2021

cooktheryan commented Sep 24, 2021

sugarraysam left a comment

sugarraysam Oct 5, 2021 •

edited

Loading

sugarraysam Oct 5, 2021 •

edited

Loading

sugarraysam Oct 5, 2021 •

edited

Loading

sugarraysam Oct 5, 2021 •

edited

Loading

sugarraysam Oct 5, 2021

fpetkovski commented Oct 6, 2021

nb-ohad commented Oct 6, 2021 •

edited

Loading

fpetkovski commented Oct 6, 2021

sugarraysam commented Oct 6, 2021 •

edited

Loading

nb-ohad commented Oct 6, 2021

bwplotka commented Oct 7, 2021 •

edited

Loading

simonpasquier Oct 7, 2021

simonpasquier left a comment

bwplotka commented Oct 8, 2021

jeremyeder commented Oct 8, 2021

simonpasquier commented Oct 19, 2021

openshift-ci bot commented Oct 19, 2021


		#### Story 1

		As a MTSRE/MTO, I would like to define ServiceMonitors that will scrape metrics from my managed service by a Prometheus-compatible system.


		#### Story 3

		As an MTSRE/MTO, I would like to create alerting and recording rules for Prometheus metrics exposed by my managed service.


		#### Story 4

		As an MTSRE, I would like to configure routing for alerts defined on Prometheus metrics exposed by my SaaS/Addon.


		#### Story 6

		As an MTO, I would like to create dashboards for certain metrics across all clusters where the SaaS/Addon is deployed or used.

		#### Story 7

		As a Customer, I can use UWM without seeing Addons metrics, alerting or recording configurations.

Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866

Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866

Conversation

bwplotka commented Aug 12, 2021 • edited Loading

bwplotka commented Aug 14, 2021 • edited Loading

bwplotka commented Aug 16, 2021

simonpasquier left a comment

Choose a reason for hiding this comment

simonpasquier Aug 17, 2021

Choose a reason for hiding this comment

bwplotka Aug 19, 2021

Choose a reason for hiding this comment

fpetkovski Aug 30, 2021

Choose a reason for hiding this comment

aditya-konarde left a comment

Choose a reason for hiding this comment

matej-g left a comment

Choose a reason for hiding this comment

jan--f left a comment

Choose a reason for hiding this comment

simonpasquier commented Sep 24, 2021

jeremyeder commented Sep 24, 2021

cooktheryan commented Sep 24, 2021

sugarraysam left a comment

Choose a reason for hiding this comment

sugarraysam Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

sugarraysam Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

sugarraysam Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

sugarraysam Oct 5, 2021 • edited Loading

Choose a reason for hiding this comment

sugarraysam Oct 5, 2021

Choose a reason for hiding this comment

fpetkovski commented Oct 6, 2021

nb-ohad commented Oct 6, 2021 • edited Loading

fpetkovski commented Oct 6, 2021

sugarraysam commented Oct 6, 2021 • edited Loading

nb-ohad commented Oct 6, 2021

bwplotka commented Oct 7, 2021 • edited Loading

simonpasquier Oct 7, 2021

Choose a reason for hiding this comment

simonpasquier left a comment

Choose a reason for hiding this comment

bwplotka commented Oct 8, 2021

jeremyeder commented Oct 8, 2021

simonpasquier commented Oct 19, 2021

openshift-ci bot commented Oct 19, 2021

bwplotka commented Aug 12, 2021 •

edited

Loading

bwplotka commented Aug 14, 2021 •

edited

Loading

sugarraysam Oct 5, 2021 •

edited

Loading

sugarraysam Oct 5, 2021 •

edited

Loading

sugarraysam Oct 5, 2021 •

edited

Loading

sugarraysam Oct 5, 2021 •

edited

Loading

nb-ohad commented Oct 6, 2021 •

edited

Loading

sugarraysam commented Oct 6, 2021 •

edited

Loading

bwplotka commented Oct 7, 2021 •

edited

Loading