Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added proposal for HyperShift monitoring. #981

Merged
merged 15 commits into from
Jun 13, 2022

Conversation

bwplotka
Copy link
Contributor

@bwplotka bwplotka commented Dec 8, 2021

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

@mmazur
Copy link
Member

mmazur commented Dec 9, 2021

A general comment – I'm having trouble reading the diagrams, because the terminology isn't consistent with https://hypershift-docs.netlify.app/reference/concepts/
Granted, there's not a term for "hosted cluster worker node part" and you use "dataplane". Not sure if that's intuitive/optimal.


### User Stories

TBD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to be fleshed out before this enhancement should be merged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind helping us with that? @jeremyeder @csrwng

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like no help, so will try to fill here what we know so far.

Copy link
Contributor

@alvaroaleman alvaroaleman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for taking the time to put this together :)

I have some feedback, but on a high-level my biggest concern is that this conflates "Requirements for Hypershift" with "Requirements for an observability product suited for the SD organization". IMHO this enhancement should only describe one of the two and the interface between them, but not both.

Now that I am at the end of the doc my impression is that this mostly describes requirements for the obvervability service, not for Hypershift, as the number of changes for Hypershift seem to be tiny:

  • Setup management cluster monitoring to remote_write (It is debatable is this is even sth hypershift should do)
  • Deploy some kind of agent into the cluster that scrapes service-provider-owned components and remote_writes that into a central metrics service.

Is my understanding correct?

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

## Alternatives

#### Pay for Observability Vendor
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an alternative to the proposed changes in Hypershift, this is an alternative to the in-house observability product.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a viable alternative we already use for logging AFAIK, no?

I feel it's integral to the solution we propose.


#### What About Layered Service / Addons Monitoring?

This was discussed and solved by [Monitoring Stacks](monitoring-stack-operator.md). For HyperShift context, Addons requirements are no different to what we assume Customer would want to do / use. So in the context of this enhancement Addons are just another CUS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see why this would be a non-goal above. It's easier to say they are just "Customer workloads".
However, I don't think this is the right attitude if we want addons to succeed as a whole.
I'm thinking internal (RH) here, where there's this RHOBS thing that is solving a lot of problems.
Maybe this isn't the right forum, but why wouldn't addons use that service?

Copy link
Contributor Author

@bwplotka bwplotka Dec 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see your point, I don't think that is the message that is here. I never said Addons are never part of RHOBS. You are already part of it and using it, so there is definitely the direction to have you in.

We only try to descope too many topics from one proposal. We like it or not our requirements are really not different than customer ones for addons observability (by what we gathered so far). Addons are first citizens as SD org in RHOBS, but AFAIK you can deploy the same way to HyperShift + this enhancement AND to normal OSD and that is the point here, no? (:

Sorry if you got it as "addons" are less important. They are not less - they are equally important. Is there a way I could improve the language here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense.
Maybe all that's needed here is saying that RHOBS would still be the intended metrics & alerting solution for RH managed services on Hypershift as per https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/monitoring-stack-operator.md) ?

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic work! Similar to what @alvaroaleman already mentioned I think, I'd like to see more details about what needs to be changed in the cluster monitoring operator (CMO) to support the HyperShift deployment model. Describing how CMO works currently in HyperShift would be valuable too (I assume that @csrwng has the best knowledge here).
It would good to detail which components depend on the monitoring interfaces today (console, metrics API, HPA, ...) and how they will be impacted (for instance, do HyperShift customers have access to the OCP console? if yes from where would the console retrieve the metrics/alerts?).

@bwplotka
Copy link
Contributor Author

Hi All!

Please PTAL latest update to this enhancement. Changes:

  • Changed data-plane details given feedback from the in-cluster team. Right now we propose deploying an unchanged CMO that will be also used to remote write metrics tiny amount metrics for HSRE use to RHOBS.
  • Removed details around RHOBS. Given the feedback we got, the proposal now focuses on changed in OpenShift in HyperShift topology, without mentioning RHOBS platform details.

Signed-off-by: Bartlomiej Plotka bwplotka@gmail.com

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved

### Open Questions

* Who will configure CMO to allow remote writing metrics to RHOBS?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both the management cluster and guest cluster case I expect CMO to be configured by SD. Specifically by App SRE as they manage the RHOBS instance and know where metrics should be routed, but in collaboration with SRE Platform to work through nuances such as guest clusters with no egress.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • mgmt CMO is standard OSD/ROSA by SREP?

  • we should consider solving for guest clusters with no egress because that solves everything?

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
Let’s take a look at all parts:

1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped.
2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the requirement for prometheus here? How does the agent not suffice?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that it's a matter of time to market. The Prometheus operator doesn't support Prometheus in agent mode yet though we have work in progress (prometheus-operator/prometheus-operator#3989).
In the mean time, I suppose that the monitoring stack operator could deploy a Prometheus server with no rules and not exposing the API endpoints.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽

Let’s take a look at all parts:

1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped.
2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would assume that it's a matter of time to market. The Prometheus operator doesn't support Prometheus in agent mode yet though we have work in progress (prometheus-operator/prometheus-operator#3989).
In the mean time, I suppose that the monitoring stack operator could deploy a Prometheus server with no rules and not exposing the API endpoints.

Let’s take a look at all parts:

1. Hosted control planes does not deploy any own monitoring, other than relevant Service Monitors to be scraped.
2. For all control planes we propose to deploy [Monitoring Stack Operator](https://github.com/rhobs/monitoring-stack-operator) that will deploy set of Prometheus/Prometheus Agents that will be sending data off cluster to RHOBS. No local alerting and local querying will be allowed. This path will forward both monitoring data as well as part of telemetry relevant for Telemeter.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure I understand, there will be one "service" per management cluster scraping and forwarding metrics for all hosted control planes?

enhancements/monitoring/hypershift-monitoring.md Outdated Show resolved Hide resolved
1. Ensuring control plane service monitors: HOSTEDCP-169
2. Ensuring forwarding mechanism (e.g using monitoring stack and Prometheus/Prometheus agent) on management cluster.
3. Configuring it for each hosted control planes automatically, so it remote writes to RHOBS using correct credentials and correct whitelist for both monitoring and telemetry.
4. Allowing deployment of CMO in data plane: MON-2143
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already the case as @csrwng commented above

Copy link
Member

@jewzaam jewzaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few points discussed on the sync call this morning that need incorporated if not already.

  1. need consistent key on all metrics from a cluster to identify the cluster
  2. this includes all SRE and managed service metrics
  3. running telemeter-client to scrape and push worker, SRE, and managed service metrics might be a good solution

There would be changes needed I think. We need to have a way to configure what metrics get shipped beyond the standard ones. We need a way to set the frequency for scraping and remote writing. I don't know that these need to be different. Probably doesn't make sense to be different actually.

@bwplotka
fyi @dofinn

@dofinn
Copy link
Contributor

dofinn commented Feb 4, 2022

A few points discussed on the sync call this morning that need incorporated if not already.

  1. need consistent key on all metrics from a cluster to identify the cluster
  2. this includes all SRE and managed service metrics
  3. running telemeter-client to scrape and push worker, SRE, and managed service metrics might be a good solution

There would be changes needed I think. We need to have a way to configure what metrics get shipped beyond the standard ones. We need a way to set the frequency for scraping and remote writing. I don't know that these need to be different. Probably doesn't make sense to be different actually.

@bwplotka fyi @dofinn

Regardless if its telemeter that ships or prom/remoteWrite that ships. SRE will need _id field on all exported metrics. This is a problem we have not solved in OSD/ROSA https://issues.redhat.com/browse/OSD-6573 as we dont have the ability to configure external_labels dynamically per clusters/remoteWrite into obs-mst.

@jan--f
Copy link
Contributor

jan--f commented Feb 4, 2022

CMO adds the _id label to the telemeter-client config today. We can certainly do the same for any remote_write config as well.

@simonpasquier
Copy link
Contributor

We need a way to set the frequency for scraping and remote writing. I don't know that these need to be different. Probably doesn't make sense to be different actually.

Remote write would just push samples as they get ingested by Prometheus. Though you can control the size of the shards, the time to wait before sending a batch, what's the max number of samples per batch, ... (see queue_config in https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write).

@csrwng
Copy link
Contributor

csrwng commented May 11, 2022

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 11, 2022
@bwplotka
Copy link
Contributor Author

bwplotka commented Jun 8, 2022

Thanks, resolving the CI

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 8, 2022
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Changes:
* Changed data-plane details given feedback from in-cluster team
* Changed how HyperShift admins can forward metrics to RHOBS on data-plane
* Removed details around RHOBS.

Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
Signed-off-by: Bartlomiej Plotka <bwplotka@gmail.com>
@simonpasquier
Copy link
Contributor

/approve

During the last HyperShift meeting, involved parties agreed to move forward and merge the enhancement proposal.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 9, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2022
@csrwng
Copy link
Contributor

csrwng commented Jun 13, 2022

/label tide/merge-method-squash

@openshift-ci openshift-ci bot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Jun 13, 2022
@csrwng
Copy link
Contributor

csrwng commented Jun 13, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 13, 2022

@bwplotka: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 58894a6 into openshift:master Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.