cAdvisor crashed due to OOM #2856

MonicaMagoniCom · 2021-04-28T09:10:46Z

We have deployed cadvisor v0.39.0 as daemonset in our Kubernetes cluster where nodes have version 1.14.10-gke.42.
Even if we have disabled many metrics, cadvisor instances continue experiencing OOM.

Here is our configuration:

args:
- --disable_metrics=tcp,advtcp,udp,sched,hugetlb,cpuset,disk,accelerator,diskIO,resctrl,memory_numa,referenced_memory
- -v=0
- --storage_duration=0s
- --housekeeping_interval=15s
- --disable_root_cgroup_stats=false
- --docker_only=true
  image: gcr.io/cadvisor/cadvisor:v0.39.0

resources:
limits:
cpu: 2500m
memory: 700Mi
requests:
cpu: 100m
memory: 200Mi

As you can see in the attached image, the memory has suddenly a pick (without any particular reason) and then it crashes (since the memory limit is 700Mi).

iwankgb · 2021-05-07T16:40:32Z

How many containers per node are you running?

iwankgb · 2021-05-07T16:43:10Z

I guess that #2840 might be related. Looks like we may need to bisect 0.36 and 0.37.

MonicaMagoniCom · 2021-05-10T07:24:16Z

How many containers per node are you running?

We have just one container per each node

MonicaMagoniCom · 2021-05-10T07:25:21Z

I guess that #2840 might be related. Looks like we may need to bisect 0.36 and 0.37.

Why does it seem to be related? I'm running 0.39

MonicaMagoniCom · 2021-05-12T12:41:06Z

I add these flags:

--store_container_labels=false
--whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace,annotation.io.kubernetes.container.restartCount

and as you see in the attached image, there is no more a sudden increase of the memory, but still the memory used is high and there are restarts due to OOM. The memory increase is linear on the nodes with less resources (which means less load), but it is critical on bigger nodes.
The biggest nodes of the cluster have the following values: (Capacity | Allocatable | Total requested)
CPU | 8 CPU | 7.91 CPU | 6.15 CPU
Memory | 31.62 GB | 27.86 GB | 20.23 GB

The smallest one:
CPU | 4 CPU | 3.92 CPU | 3.03 CPU
Memory | 16.8 GB | 13.94 GB | 6.77 GB

jlange-koch · 2021-08-09T11:28:45Z

We are experiencing a similar behaviour on our GKE cluster.

Curiously it only happens on nodes that have Containerd as runtime (Container-Optimised OS with Containerd (cos_containerd)).
If Cadvisor runs on nodes that run with docker runtime (Container-optimised OS with Docker (cos) (default)), it behaves fine.

Cadvisor image: latest
kubernetes version: 1.18.20-gke.501

EDIT:
This seems to be fixed when using version v0.40.0

wrathchild14 · 2024-03-08T22:05:32Z

For me, the error was because of the flag --storage_duration=0s, it stored the metrics data indefinitely. I set it to 5 seconds and the OOM error disappeared.

iwankgb self-assigned this May 7, 2021

MonicaMagoniCom mentioned this issue May 10, 2021

cadvisor daemonset with containerd goes into crashloopbackoff #2855

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cAdvisor crashed due to OOM #2856

cAdvisor crashed due to OOM #2856

MonicaMagoniCom commented Apr 28, 2021 •

edited

Loading

iwankgb commented May 7, 2021

iwankgb commented May 7, 2021

MonicaMagoniCom commented May 10, 2021

MonicaMagoniCom commented May 10, 2021

MonicaMagoniCom commented May 12, 2021 •

edited

Loading

jlange-koch commented Aug 9, 2021 •

edited

Loading

wrathchild14 commented Mar 8, 2024

cAdvisor crashed due to OOM #2856

cAdvisor crashed due to OOM #2856

Comments

MonicaMagoniCom commented Apr 28, 2021 • edited Loading

iwankgb commented May 7, 2021

iwankgb commented May 7, 2021

MonicaMagoniCom commented May 10, 2021

MonicaMagoniCom commented May 10, 2021

MonicaMagoniCom commented May 12, 2021 • edited Loading

jlange-koch commented Aug 9, 2021 • edited Loading

wrathchild14 commented Mar 8, 2024

MonicaMagoniCom commented Apr 28, 2021 •

edited

Loading

MonicaMagoniCom commented May 12, 2021 •

edited

Loading

jlange-koch commented Aug 9, 2021 •

edited

Loading