Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed OLM-based Monitoring Stack Solution for RH Managed Services and future needs. #866

Merged

Conversation

bwplotka
Copy link
Contributor

@bwplotka bwplotka commented Aug 12, 2021

As an effect of the collaboration of the Monitoring Team with Service Delivery org in our Monitoring Enhancement Working Group, we on behalf of our team we would like to propose with @fpetkovski a solution for immediate needs for Managed Services (SaaS/Layered Services/Addons).

Feedback is very welcome!

PTAL @alanmoran @smarterclayton @kbsingh @pb82 @simonpasquier @brampling

Please keep all offline discussions in this PR so everyone is up-to-date.

Signed-by: Bartlomiej Plotka bwplotka@gmail.com

…and future needs.

As an effect of the collaboration of Monitoring Team with Service Delivery org in our Monitoring Enhancement Working Group, with @fpetkovski would like to propose solution for immdiate needs for Managed Services (SaaS/Layered Services/Addons).

Feedback very welcome! Please keep all offline discussions in this PR so everyone is up-to-date.

Signed-by: Bartlomiej Plotka <bwplotka@gmail.com>
fpetkovski and others added 4 commits August 12, 2021 15:49
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Contributor Author

bwplotka commented Aug 14, 2021

Got bit upset by linter, produced issue: #869 (: (with some ideas how to improve it)

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@bwplotka
Copy link
Contributor Author

/test markdownlint

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very good! Thanks for the effort.

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved

For simplicity reasons, routing alerts that come from managed service monitoring stacks will be handled by a single Alertmanager.
We can take advantage of the recently added <code>[AlertmanagerConfig](https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md#alertmanagerconfig)</code> and allow MTO/MTSRE to create their own routing configuration in a centrally deployed Alertmanager instance.
Finally, the MTSRE team could centrally configure receivers in one place for all managed services and service owners would simply use those receivers in their routing rules. It is worth noting that the last feature depends on [https://github.com/prometheus-operator/prometheus-operator/issues/4033](https://github.com/prometheus-operator/prometheus-operator/issues/4033).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this issue as a blocker since it's already possible to configure the global Alertmanager configuration with a secret. Having said that supporting the same structures for global and per-namespace Alertmanager configurations would be nice indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier is it possible to use both the secret and an AlertmanagerConfig? I thought that with the AlertmanagerConfig everything gets scoped to the namespace from which the config is coming. If that is the case, how would a tenant reference a receiver from the common secret in their AlertmanagerConfig?

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
Copy link
Contributor

@aditya-konarde aditya-konarde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this! ❤️

Added some suggestions and clarifying questions :)

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
Copy link
Contributor

@matej-g matej-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid chunk of work 💪

Mostly minor suggestions from my side

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jan--f jan--f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, this sounds really good! Left a few comments/suggestions inline.

I'm not sure whether the following is out of scope for this document or not, but should we consider maintenance of this somewhere? Maybe this risk section?
I assume the Monitoring team will maintain this. Adding an additional operator to the teams maintenance workload seems worth mentioning.

enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
enhancements/monitoring/monitoring-stack-operator.md Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Sep 16, 2021
@simonpasquier
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 24, 2021
@jeremyeder
Copy link
Contributor

@cooktheryan our robot overlords have you set as the next reviewer. Is that correct?

@cooktheryan
Copy link

@jeremyeder I wish that was the case but I don't believe I am the correct individual to be making that decision. This should be tagged for someone else

Copy link

@sugarraysam sugarraysam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, I am part of the MTSRE team and would like to thank you for putting this proposal together. This is awesome and I am looking forward to working together in order to make this new project happen.

I posted a few comments/suggestions, and they all have the same two underlying goals:

  1. Expose an SLO based interface to our tenants
  2. Make sure we perform active reconciliation on this interface (operator-pattern)

This is very important as the end goal here is to be able to measure SLOs/SLIs and error budgets. We should not ask our MTO to tweak or touch any prometheus/alertmanager/grafana configs. They need to spend their time doing feature development and bug fixing.

There exists various tools out there that could help us abstract away prometheus/alertmanager and grafana configs:


#### Story 1

As a MTSRE/MTO, I would like to define ServiceMonitors that will scrape metrics from my managed service by a Prometheus-compatible system.
Copy link

@sugarraysam sugarraysam Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to challenge the interface we want to expose. Instead of ServiceMonitor it would be so much better to provide an SLO-based interface. The reasoning is that we want to abstract as much of the monitoring stack as we can. We want MTO to focus on SLOs and SLIs, not being Prometheus experts. The story I suggest is this one:

As a MTSRE/MTO, I would like to define a SLO based CR that will create Prometheus-compatible config to scrape metrics from my managed service by a Prometheus-compatible system.

Example implementation:


#### Story 3

As an MTSRE/MTO, I would like to create alerting and recording rules for Prometheus metrics exposed by my managed service.
Copy link

@sugarraysam sugarraysam Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again proposing to abstract alerting and recording rules away, in favor of PrometheusServiceLevel CR or something similar (sloth):

As an MTSRE/MTO, I would like to create SLO-oriented CRs, that will translate to Prometheus alerting and recording rules for my managed service.

Example implementation:


#### Story 4

As an MTSRE, I would like to configure routing for alerts defined on Prometheus metrics exposed by my SaaS/Addon.
Copy link

@sugarraysam sugarraysam Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again proposing to abstract routing away, in favor of PrometheusServiceLevel CR or something similar (sloth):

As an MTSRE/MTO, I would like to create SLO-oriented CRs, that will configure alerts defined on Prometheus metrics exposed by my SaaS/Addon.

Example implementation:


#### Story 6

As an MTO, I would like to create dashboards for certain metrics across all clusters where the SaaS/Addon is deployed or used.
Copy link

@sugarraysam sugarraysam Oct 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SLO dashboard can be abstracted as well. Creating grafana dashboards is extremely time consuming, and as an MTSRE engineer, I want to have an homogeneous environment when switching from one addon to another. Plus, we do not want MTO to become grafana dashboard experts, or spend any time on this. This is why I would highly suggest:

As an MTO, I would like to provide an SLO based CR, that automatically creates dashboards that allow me to visualize SLO/SLI and error budgets for certain metrics across all clusters where the SaaS/Addon is deployed or used.

Example implementation:

#### Story 7

As a Customer, I can use UWM without seeing Addons metrics, alerting or recording configurations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding an extra requirement for MTSRE/MTO:

### Story 8

As an MTO, I would like to provide an SLO based CR, that is actively reconciled by a kubernetes-based operator, so that I can update my SLO/SLI configurations smoothly.

Example implementation:

@fpetkovski
Copy link
Contributor

Hi @sugarraysam, thanks for the suggestions, having a higher-level abstraction on top of ServiceMonitors and Dashboards is an awesome idea. Since it comes a bit late into the proposal process and we've already started work on this project, could you create an issue with your suggestions instead? This is the repository for the project: https://github.com/rhobs/monitoring-stack-operator

@nb-ohad
Copy link

nb-ohad commented Oct 6, 2021

@fpetkovski @sugarraysam
I think a higher-level abstraction is a nice idea on paper but the reality is that the SRE team or even the tenants' eng teams are not always the ones who define the Monitoring resources.

In the cases I have seen so far, including our own case - ODF MS, the underlying product (that is backing up the service) is the one responsible to deploy/manage/reconcile the monitoring resources (ServiceMonitors, PodMonitors, and PrometheusRuels).

If we go with the proposed abstraction as the only way to configure the scraping we will create a situation where these kinds of tenants will have to "jump through many hops", and introduce some ugly workarounds, just to configure scraping.

If we look at the big picture, taking into account all the deployed components (from infrastructure to the products themselves), we will see that all these workarounds are just creating a less stable solution.

@fpetkovski
Copy link
Contributor

I think that we don't have to make a mutually-exclusive decision. We can still support all the existing APIs and add a few ones on top.

@sugarraysam
Copy link

sugarraysam commented Oct 6, 2021

@nb-ohad @fpetkovski I like the idea of exposing multiple interfaces, it makes it flexible and allows a transitioning period. I understand it might be some work to migrate the existing monitoring to an SLO-based approach, but in the long-run this is what we should aim for. Monitoring without SLO does not make much sense. Why are you alerting on X or Y? How does it relate to your users? CPU usage and memory consumption without context are not good measures. The whole industry is shifting towards SLO based monitoring.

Will follow up with a feature request (issue) -> https://github.com/rhobs/monitoring-stack-operator

(edit: this is what app-sre is doing in app-interface as well /schemas/app-sre/slo-document-1.yml)

@nb-ohad
Copy link

nb-ohad commented Oct 6, 2021

@sugarraysam I get your point, but your proposal is narrowing the solution to a single use case which is to facilitate service SLO tracking and visibility. There are other use cases for monitoring a managed service, for example, providing product-defined dashboard / OCP console plugins to the customer (as ODF is doing). The metrics that are needed to accomplish that are unrelated to SLO, the SLIs, or even the serviceability of the service.

Saying that I don't see the proposed interface as a replacement of the "older" interface, maybe as an enhancement. This means that we should not put any expectation for a transition to happen for some offerings.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2021
@bwplotka
Copy link
Contributor Author

bwplotka commented Oct 7, 2021

I updated the enhancement with the following changes:

  • Added Graduation criteria
  • Added story for exposing metrics to Console and how to do it.
  • Added mention about further steps around soft tenancy potential in other iterations
  • Fixed typos

As per your comments @sugarraysam thanks for your idea. It is being discussed in rhobs/observability-operator#13 and I think it's solid thing to implement on top of ServiceMonitors. I don't think we can remove everything else which is not SLO based. Monitoring is more than SLO. It's also about troubleshooting, sometimes billing, data analysis of features etc. So I would not trim all down to SLO usage only, even as MTSRE you have right to ignore anything else. Does it make sense? (:

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>

#### Tech Preview -> GA

TBD
* More options exposed in Monitoring Stack resource
* We have onboard at lease few Addons on this feature.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at lease" -> "at least"

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 7, 2021
@bwplotka
Copy link
Contributor Author

bwplotka commented Oct 8, 2021

What's missing on this to get this merged? 🤗

@jeremyeder
Copy link
Contributor

I scanned this whole thing again and since the SLO abstraction (which I also highly support) is scoped out, I would lgtm this as-is.

@simonpasquier
Copy link
Contributor

/approve

I think that everybody had a chance to share their feedback and the current version reflects as much as we can where we want to/should go. Thank you all for the constructive comments and thanks @bwplotka and @fpetkovski for this significant proposal :)

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 19, 2021
@openshift-merge-robot openshift-merge-robot merged commit ad49b8e into openshift:master Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.