-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicated metrics for restarted Pods #2844
Comments
Did you ever solve this? We're experiencing the exact same. |
I have solved this by change sum(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod) to max(rate(container_cpu_usage_seconds_total{xxx}[$interval])) by (pod) Now that I think it actually makes kind of sense... |
I never actually managed to find a solution, just whenever dealing with OOM kills I'm wary of this problem so never trust peak memory usages. But I suppose something like |
I think I'm seeing something similar. When using Karpenter (AWS EKS) for auto-scaling, Karpenter will add the label |
@LeoHsiao1 Just curious where you found this? I'm trying understand the situation better and if this is expected behavior or a bug. I tried to reproduce it locally by running a |
@jtnz Hi |
How should I interpret a reporting for the same time and the same container but with DIFFERENT names? Is the used memory the sum of these two (915144704 + 799879168) or the maximum of these two values (max(915144704, 799879168))? My guess is the maximum and it's just a duplicate reporting with the same timestamp but different microseconds.
I appreciate your feedback. |
I believe I have encountered a bug where multiple values are exported for the same Pod at the same point in time when that Pod has been restarted.
I have been doing some load tests against an app in K8s and I noticed something. The Pod had a limit set to 1Gi and while I was attacking the app with requests the Pods restarted a few times.
When I looked at the graphs in Grafana, it seemed like Pods are using way over 2.6GiB of memory. That didn't make much sense so I investigated the query, which led me to finding this issue.
Querying the
container_memory_working_set_bytes
metric in Prometheus I got the following:Notice how individual starts of the Pods are recorded for the same point in time - see the
name
labels ending_7
,_8
,_9
,_10
for example. These are four instances of the same exact Pod but in reality they never ran at the same time (they're restarts). If I add together these values it will give 2.6GiB which is the number I saw in Grafana. I can confirm this from other graphs that memory usage on the nodes never registered a 2.6GiB increase but they saw 1GiB, which is my limit.I use Amazon EKS with Kubernetes version v1.19.6-eks-49a6c0 which I believe uses cAdvisor v0.37.3.
The text was updated successfully, but these errors were encountered: