Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cAdvisor crashed due to OOM #2856

Open
MonicaMagoniCom opened this issue Apr 28, 2021 · 7 comments
Open

cAdvisor crashed due to OOM #2856

MonicaMagoniCom opened this issue Apr 28, 2021 · 7 comments
Assignees

Comments

@MonicaMagoniCom
Copy link

MonicaMagoniCom commented Apr 28, 2021

We have deployed cadvisor v0.39.0 as daemonset in our Kubernetes cluster where nodes have version 1.14.10-gke.42.
Even if we have disabled many metrics, cadvisor instances continue experiencing OOM.

Here is our configuration:

  • args:
    • --disable_metrics=tcp,advtcp,udp,sched,hugetlb,cpuset,disk,accelerator,diskIO,resctrl,memory_numa,referenced_memory
    • -v=0
    • --storage_duration=0s
    • --housekeeping_interval=15s
    • --disable_root_cgroup_stats=false
    • --docker_only=true
      image: gcr.io/cadvisor/cadvisor:v0.39.0

resources:
limits:
cpu: 2500m
memory: 700Mi
requests:
cpu: 100m
memory: 200Mi

As you can see in the attached image, the memory has suddenly a pick (without any particular reason) and then it crashes (since the memory limit is 700Mi).


memory-cadvisor

@iwankgb iwankgb self-assigned this May 7, 2021
@iwankgb
Copy link
Collaborator

iwankgb commented May 7, 2021

How many containers per node are you running?

@iwankgb
Copy link
Collaborator

iwankgb commented May 7, 2021

I guess that #2840 might be related. Looks like we may need to bisect 0.36 and 0.37.

@MonicaMagoniCom
Copy link
Author

How many containers per node are you running?

We have just one container per each node

@MonicaMagoniCom
Copy link
Author

I guess that #2840 might be related. Looks like we may need to bisect 0.36 and 0.37.

Why does it seem to be related? I'm running 0.39

@MonicaMagoniCom
Copy link
Author

MonicaMagoniCom commented May 12, 2021

I add these flags:

  • --store_container_labels=false
  • --whitelisted_container_labels=io.kubernetes.container.name,io.kubernetes.pod.name,io.kubernetes.pod.namespace,annotation.io.kubernetes.container.restartCount

and as you see in the attached image, there is no more a sudden increase of the memory, but still the memory used is high and there are restarts due to OOM. The memory increase is linear on the nodes with less resources (which means less load), but it is critical on bigger nodes.
The biggest nodes of the cluster have the following values: (Capacity | Allocatable | Total requested)
CPU | 8 CPU | 7.91 CPU | 6.15 CPU
Memory | 31.62 GB | 27.86 GB | 20.23 GB

The smallest one:
CPU | 4 CPU | 3.92 CPU | 3.03 CPU
Memory | 16.8 GB | 13.94 GB | 6.77 GB

Schermata da 2021-05-12 14-26-21

@jlange-koch
Copy link

jlange-koch commented Aug 9, 2021

We are experiencing a similar behaviour on our GKE cluster.

Curiously it only happens on nodes that have Containerd as runtime (Container-Optimised OS with Containerd (cos_containerd)).
If Cadvisor runs on nodes that run with docker runtime (Container-optimised OS with Docker (cos) (default)), it behaves fine.

Cadvisor image: latest
kubernetes version: 1.18.20-gke.501

EDIT:
This seems to be fixed when using version v0.40.0

@wrathchild14
Copy link

For me, the error was because of the flag --storage_duration=0s, it stored the metrics data indefinitely. I set it to 5 seconds and the OOM error disappeared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants