Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[receiver/kubeletstats] Add metric to represent the percentage of cpu and memory utilization compared to the configured limit #24905

Closed
TylerHelmuth opened this issue Aug 4, 2023 · 19 comments · Fixed by #27276
Assignees
Labels

Comments

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Aug 4, 2023

Component(s)

No response

Is your feature request related to a problem? Please describe.

The receiver reports the current memory and cpu usage for pods and containers, but unless you know the limits set for the containers in the pod it is hard to determine if that value is getting too high.

Describe the solution you'd like

A new set of pod and container metrics that represents the utilization of memory and cpu that the pod/container is consuming based on the usage value and the configured limits.

  • k8s.pod.memory.usagePercent
  • k8s.pod.cpu.usagePercent
  • k8s.container.memory.usagePercent
  • k8s.container.cpu.usagePercent
@TylerHelmuth TylerHelmuth added enhancement New feature or request needs triage New item requiring triage priority:p2 Medium receiver/kubeletstats and removed needs triage New item requiring triage labels Aug 4, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 4, 2023

Pinging code owners for receiver/kubeletstats: @dmitryax. See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jmichalek132
Copy link
Contributor

I mean there is already another metric for the limits, isn't comparing the two in your storage/query pipeline enough?

@TylerHelmuth
Copy link
Member Author

@jmichalek132 I am not aware of a metric collected by kubletstats for individual container resource limits or representing the sum of the container limits for the whole pod.

But I don't feel we should limit what types of metrics the collector can produce. As a storage/query-agnostic solution the Collector should be able to produce any metric that the user needs. We already perform other computations while collecting metrics. I don't think we should restrict ourselves from producing meaningful metrics based on other metrics (in this case a the percentage of UsageNanoCores use compared to a limit).

@jinja2
Copy link
Contributor

jinja2 commented Aug 18, 2023

If we add utilization compared to limits, we should also introduce utilization compared to requests. Or at least make the metric reflect whether is relative to the limit or request.

@dmitryax
Copy link
Member

If we add utilization compared to limits, we should also introduce utilization compared to requests. Or at least make the metric reflect whether is relative to the limit or request.

This is a good point. I'm curious what metrics other k8s monitoring agents provide regarding resource utilization.

@TylerHelmuth
Copy link
Member Author

I will do some research. A metric with respect to resources.requests is doable, but would be less value that resources.limits in my opinion since Requests are a scheduling concern, not a runtime concern.

@TylerHelmuth
Copy link
Member Author

Seeing when your usage is above your requests could be interesting as this would indicate the you've under-provisioned your containers/pod and that Kubernetes might evict it.

Going back to the OTel naming specification, utilization is currently reserved to mean the fraction of usage out of its limit, where limit is defined as the "constant, known total amount of something". Since k8s's will kill pods above its defined resources.limits, I think .utilization should be reserved for representing the usage compared to resources.limits.

I think measurement comparing usage to resources.requests is a good idea, but I don't think it should be named utilization. I also think such a metric should be allowed to go above 1 (100%), since tracking how often the metric is above 1 is what is interesting.

As for research, I can share that we (Honeycomb) have an agent that emits a utilization metric based on limits. I also found some DataDog documentation recommending utilization compared to limits and compared to requests. Kube-state-metrics reports the values individually and expects you to compare them yourself, but I am still staunchly in favor of calculating the result within the Collector.

@jinja2
Copy link
Contributor

jinja2 commented Aug 22, 2023

I should've given more context on why I think usage wrt request is also useful, in addition to the limits. Requests (minimum amount of guaranteed resources) are relevant even after pods have been scheduled, and being able to track usage w.r.t requests is important for users to ensure that optimal resource requests are set to keep their applications running with reduced disruptions. One runtime use is when a node is under resource pressure, one of the factors taken into consideration when deciding which pods to evict is the usage w.r.t requests, so setting appropriate requests is important to reduce chances of being evicted.

Another use case for the metric could be when making vertical scaling decisions. Users who don't use Guaranteed QoS strategy (request = limit), usually take into account if an application has been using more/less than requested resources to right-size their workloads. It can also be useful to detect changes in app behavior/benchmarks, for e.g., a container consistently using more than requested memory after a new version rollout, could indicate that recent changes might've introduced some additional unaccounted for memory overhead, and the minimum memory requirement for the application to run has changed.

CPU request decides the weighted distribution of CPU time for containers, so tracking usage wrt requests to set appropriate values for req is important to avoid cpu starvation. There are some resource allocation guidelines where users are recommended to only set appropriate CPU requests and no limits (cause limits can prevent processes from utilizing idle CPU cycles as they become available on the node).

@jinja2
Copy link
Contributor

jinja2 commented Aug 22, 2023

Re: the proposed metric k8s.pod.memory.usagePercent and k8s.pod.cpu.usagePercent, since limits/requests are set on containers, I assume these are calculated as sum of container limits. It would make sense to not calculate for the pod if any of its containers does not have the limit set. Is this the proposed plan?

Going back to the OTel naming specification, utilization is currently reserved to mean the fraction of usage out of its limit, where limit is defined as the "constant, known total amount of something". Since k8s's will kill pods above its defined resources.limits, I think .utilization should be reserved for representing the usage compared to resources.limits.

I think a request can be seen as a limit as defined by the convention since it is a known quantity we can meaningfully track usage against. But I can't think of a good way to indicate if the utilization is wrt to request or limit w/o having it part of the metric name as container.cpu.request.utilization which feels verbose and weird.

@TylerHelmuth
Copy link
Member Author

It would make sense to not calculate for the pod if any of its containers does not have the limit set. Is this the proposed plan?

Yes this is how I have implemented the metrics for pods.

I think a request can be seen as a limit as defined by the convention since it is a known quantity we can meaningfully track usage against.

I agree that it is a known value we can track against, but I wouldn't consider it a limit in the sense that OTel has defined the term limit since Kubernetes allows the memory or cpu usage to go above the configured requests. Otel's current naming conventions expect that utilization be fixed between [0,1], which makes sense when limit is an actual constraint. But we definitely want to allow a metric related to requests to go above 1.

Although it wasn't my original goal with this issue, I think that a metric relating the usage to the requests is a good idea. Reading through the specification, the definitions use SHOULD instead of MUST, which does give us an opening to treat requests as a limit, use the name utilization, and go above 100. I see no other defined name in the specification that fits better.

I will bring this topic up in the Collector SIG meeting tomorrow. @jinja2 are you able to attend? If we use utilization for both forms of the metric I think the suggestion to add the type in the name is the best solution. The value could go before or after. If we choose to use the term utilization with respect to request I think we should move forward with the following metrics:

  • k8s.pod.cpu.limits.utilization
  • k8s.pod.cpu.requests.utilization
  • k8s.pod.memory.limits.utilization
  • k8s.pod.memory.requests.utilization
  • container.cpu.limits.utilization
  • container.cpu.requests.utilization
  • container.memory.limits.utilization
  • container.memory.requests.utilization

If utilization cannot be used with respect to requests due to requests not being a hard limit, then we will have to come up with another name for those metrics, and the metrics measuring utilization with respect to the limits can be called *.utilization.

@jinja2
Copy link
Contributor

jinja2 commented Aug 22, 2023

@TylerHelmuth Sgtm, and, I can attend the SIG meeting tomorrow.

@TylerHelmuth
Copy link
Member Author

Update from the SIG meeting today. The community liked these metrics in the form discussed above. The next step is to submit a PR to the semantic conventions introducing the concept of a soft limit and removing the restriction that utilization must be between 0 and 1.

Separately we need to join the semantic discussion around plurality.

@jmichalek132
Copy link
Contributor

@TylerHelmuth the k8sclusterreceiver already produces metrics for cpu and memory limits.
https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/receiver/k8sclusterreceiver/metadata.yaml#L187

@TylerHelmuth
Copy link
Member Author

@jmichalek132 those metrics are the actual configured values. They are limit metrics. I'm proposing we produce some new utilization metrics based on those limits.

@dmitryax
Copy link
Member

@jmichalek132, thanks for bringing this up. That made me think more about the naming for the new utilization metrics.

Since plural requests and limits apply to all the resources in the k8s spec, it may be more natural to keep singular limit and request when applied to one resource. I think we should keep it consistent with the existing k8scluster metrics and introduce:

k8s.container.cpu_limit_utilization
k8s.container.memory_limit_utilization

Note that we cannot have k8s.container.cpu_limit.utilization because k8s.container.cpu_limit should not be introduced as a namespace if there is a metric with the same name already. But this rule might change as a result of open-telemetry/semantic-conventions#50

@TylerHelmuth
Copy link
Member Author

I am also ok with those names. If we did that convention the list of new metrics would be:

  • k8s.pod.cpu_limit_utilization
  • k8s.pod.cpu_request_utilization
  • k8s.pod.memory_limit_utilization
  • k8s.pod.memory_request_utilization
  • k8s.container.cpu_limit_utilization
  • k8s.container.cpu_request_utilization
  • k8s.container.memory_limit_utilization
  • k8s.container.memory_request_utilization

@jmichalek132
Copy link
Contributor

The main reason I brought it up since there is a way to collect the usage and limits, it's quite easy to calculate the desired data in the backend storing metrics, so would it be worth pre-calculating this (since these metrics will be produced for all pods / containers, there will be quite a few of them).

@TylerHelmuth
Copy link
Member Author

@jmichalek132 you're right that backends could calculate these metrics (assuming they can do some arithmetic, which most can). But in order for the backend to be the source of these metrics you would need the k8sclusterreceiver, which is a separate receiver and has specific deployment requirements. I don't like the idea of setting a precedence that a metric that can be calculated in receiver A should not be included, even as optional, because a combination of metrics from receiver A and some other receiver B could produce the metrics.

Users should be able to look to the kubeletstats receiver for their pod and container metrics needs. It is reasonable for users to start off with only the kubeletstats receiver and not the k8scluster receiver, especially since introducing the k8sclusterreciever means introducing a second deployment.

TylerHelmuth added a commit that referenced this issue Sep 26, 2023
**Description:**
Add metadata map for pod and container requests and limits. Will be used
to calculate new metrics in a future PR.

**Link to tracking Issue:** <Issue number if applicable>

#24905

**Testing:** <Describe what testing was performed and which tests were
added.>
added unit tests
TylerHelmuth added a commit that referenced this issue Sep 29, 2023
**Description:**
Adds new `k8s.pod.memory.utilization` and `container.memory.utilization`
metrics that represent the ratio of memory used vs limits set.

The metrics are only emitted for pods/containers that have defined
resource limits. It takes advantage of the pod metadata to acquire the
container limits. A pod limit is computed as the sum of all the
container limits and if any container limit is zero or undefined the pod
limit is also considered undefined.

**Link to tracking Issue:**
Related to
#24905

**Testing:**
Unit tests and local testing
TylerHelmuth added a commit that referenced this issue Oct 6, 2023
**Description:** 
Adds new CPU utilization metrics with respect to pod/container CPU
limits and requests

**Link to tracking Issue:** <Issue number if applicable>
Closes
#24905

**Testing:** <Describe what testing was performed and which tests were
added.>
Added new unit tests and tested locally
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
**Description:**
Add metadata map for pod and container requests and limits. Will be used
to calculate new metrics in a future PR.

**Link to tracking Issue:** <Issue number if applicable>

open-telemetry#24905

**Testing:** <Describe what testing was performed and which tests were
added.>
added unit tests
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
…metry#25894)

**Description:**
Adds new `k8s.pod.memory.utilization` and `container.memory.utilization`
metrics that represent the ratio of memory used vs limits set.

The metrics are only emitted for pods/containers that have defined
resource limits. It takes advantage of the pod metadata to acquire the
container limits. A pod limit is computed as the sum of all the
container limits and if any container limit is zero or undefined the pod
limit is also considered undefined.

**Link to tracking Issue:**
Related to
open-telemetry#24905

**Testing:**
Unit tests and local testing
jmsnll pushed a commit to jmsnll/opentelemetry-collector-contrib that referenced this issue Nov 12, 2023
…ry#27276)

**Description:** 
Adds new CPU utilization metrics with respect to pod/container CPU
limits and requests

**Link to tracking Issue:** <Issue number if applicable>
Closes
open-telemetry#24905

**Testing:** <Describe what testing was performed and which tests were
added.>
Added new unit tests and tested locally
dmitryax pushed a commit that referenced this issue Jan 12, 2024
**Description:** 
Starts the name change processor for `*.cpu.utilization` metrics. 

**Link to tracking Issue:** 
Related to
#24905
Related to
#27885
cparkins pushed a commit to AmadeusITGroup/opentelemetry-collector-contrib that referenced this issue Feb 1, 2024
…elemetry#25901)

**Description:** 
Starts the name change processor for `*.cpu.utilization` metrics. 

**Link to tracking Issue:** 
Related to
open-telemetry#24905
Related to
open-telemetry#27885
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment