Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: add initial support for SLI/SLO metrics #461

Merged
merged 1 commit into from
Oct 7, 2024

Conversation

beraldoleal
Copy link
Member

@beraldoleal beraldoleal commented Sep 26, 2024

Introduces high-level SLI metrics that can help derive SLOs.

Currently, the osc-monitor image only includes the kata-monitor binary. The idea is to reuse the same image to also include the metrics-server.

@openshift-ci openshift-ci bot requested review from gkurz and jensfr September 26, 2024 19:57
@beraldoleal beraldoleal force-pushed the sli-metrics branch 3 times, most recently from a4d06b7 to 666670d Compare September 26, 2024 21:53
Copy link
Contributor

@littlejawa littlejawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on the deployment side:
If I understand correctly, the new metrics binary will be shipped in the same image as kata-monitor, but the image will be run twice: once for kata-monitor (with the daemonset defined by the operator's code), and a second time with the deployment files you're adding here?

I'm not sure about the benefit of using the same image in separate containers that do different things.
If it is the same image, why not run the two binaries in parallel from it?
If we can't / don't want to have both binaries running from the same place, why do we make a single image ?

cmd/metrics/metrics.go Outdated Show resolved Hide resolved
config/metrics/metrics-service.yaml Show resolved Hide resolved
@beraldoleal
Copy link
Member Author

I'm not clear on the deployment side: If I understand correctly, the new metrics binary will be shipped in the same image as kata-monitor, but the image will be run twice: once for kata-monitor (with the daemonset defined by the operator's code), and a second time with the deployment files you're adding here?

Yes. Currently kata-monitor binary is part of the "osc-monitor" image. And the idea is to reuse this image for "monitoring stuff". With both kata-monitor and the metrics server.

I'm not sure about the benefit of using the same image in separate containers that do different things. If it is the same image, why not run the two binaries in parallel from it? If we can't / don't want to have both binaries running from the same place, why do we make a single image ?

Even though they're in the same image, running them in separate containers makes it easier to manage and scale each one independently. This way, kata-monitor can continue running as a DaemonSet on every node, while the metrics server can be deployed separately as a Deployment, running only where needed.

Running both in the same container is possible and I considered this option too... but it can get tricky with managing processes, restarts, and resource allocation. By using the same image for both but keeping them in their own containers, we get the best of both worlds: fewer images to manage (specially on the downstream configuration), but still clear separation between the two services. Wdyt?

@littlejawa
Copy link
Contributor

Running both in the same container is possible and I considered this option too... but it can get tricky with managing processes, restarts, and resource allocation. By using the same image for both but keeping them in their own containers, we get the best of both worlds: fewer images to manage (specially on the downstream configuration), but still clear separation between the two services. Wdyt?

I actually don't like the idea of running both processes from the same container :-) I only mentioned it to understand where the benefit is. So I'm fine running them separately.
But I tend to think that each container image should contain the things it needs to run, and nothing more - the simpler, the better. If we run them separately, then I'd be in favor of using separate images.

Now I understand this is to be balanced with the complexity of building/publishing/maintaining separate images, so I don't want to enforce that. Just feeling this is weird to deploy two applications in a single pod.

I can go with it if other reviewers are OK.

@beraldoleal
Copy link
Member Author

Now I understand this is to be balanced with the complexity of building/publishing/maintaining separate images, so I don't want to enforce that. Just feeling this is weird to deploy two applications in a single pod.

The separation won’t force us to use two applications in a single pod. Each will run in its own pod: kata-monitor as a DaemonSet and the metrics server as a Deployment.

Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running both in the same container is possible and I considered this option too... but it can get tricky with managing processes, restarts, and resource allocation. By using the same image for both but keeping them in their own containers, we get the best of both worlds: fewer images to manage (specially on the downstream configuration), but still clear separation between the two services. Wdyt?

I actually don't like the idea of running both processes from the same container :-) I only mentioned it to understand where the benefit is. So I'm fine running them separately. But I tend to think that each container image should contain the things it needs to run, and nothing more - the simpler, the better. If we run them separately, then I'd be in favor of using separate images.

Now I understand this is to be balanced with the complexity of building/publishing/maintaining separate images, so I don't want to enforce that. Just feeling this is weird to deploy two applications in a single pod.

I can go with it if other reviewers are OK.

I guess it is acceptable to ship some extra metric logic in the existing image if it eases the work.

cmd/metrics/metrics.go Outdated Show resolved Hide resolved
cmd/metrics/metrics.go Outdated Show resolved Hide resolved
cmd/metrics/metrics.go Outdated Show resolved Hide resolved
cmd/metrics/metrics.go Outdated Show resolved Hide resolved
cmd/metrics/metrics.go Show resolved Hide resolved
@gkurz
Copy link
Member

gkurz commented Sep 30, 2024

IIUC this is targeting peer pods exclusively. Maybe worth to mention it in the commit message.

@beraldoleal beraldoleal force-pushed the sli-metrics branch 2 times, most recently from 5db0bcc to 13ec837 Compare September 30, 2024 19:46
Copy link
Member

@gkurz gkurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of cosmetic changes left but rest LGTM.

Thanks @beraldoleal !

cmd/metrics/metrics.go Outdated Show resolved Hide resolved
cmd/metrics/metrics.go Outdated Show resolved Hide resolved
cmd/metrics/metrics.go Outdated Show resolved Hide resolved
@beraldoleal beraldoleal force-pushed the sli-metrics branch 2 times, most recently from ce49b19 to 030ae01 Compare October 1, 2024 17:27
Introduces high-level SLI metrics that can help derive SLOs for
kata-remote.

Currently, the osc-monitor image only includes the kata-monitor binary.
The idea is to reuse the same image to also include the metrics-server.

For now, the idea is to collect metrics for kata-remote, but we can
expand later to get more metrics.

Signed-off-by: Beraldo Leal <bleal@redhat.com>
Copy link

openshift-ci bot commented Oct 1, 2024

@beraldoleal: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/sandboxed-containers-operator-e2e 4f8386a link false /test sandboxed-containers-operator-e2e
ci/prow/check 4f8386a link false /test check

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copy link
Contributor

@littlejawa littlejawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks @beraldoleal

@beraldoleal
Copy link
Member Author

@gkurz can we merge this one?

@gkurz gkurz merged commit 8877ce1 into openshift:devel Oct 7, 2024
2 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants