Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No space left on device #528

Closed
edernucci opened this issue Jul 13, 2018 · 14 comments
Closed

No space left on device #528

edernucci opened this issue Jul 13, 2018 · 14 comments

Comments

@edernucci
Copy link

Hi there,

I'm experiencing this issue and seems to be related (or exactly) this one:

moby/moby#29638

My machines have a good disk free space and a good inode usage count. Every time I face this issue on kubelet log my /proc/cgroups are using 2500+ cgroups and I have to drain and restart the node. After reboot the cgroups usage is at 100 again and after some hours (sometimes days) the error appear again.

@cpuguy83
Copy link
Member

/cc @seanknox

@seanknox
Copy link
Contributor

@edernucci can you provide some information about your cluster?

  • node count
  • kubernetes version
  • description of cluster workloads

@dsalamancaMS
Copy link

@edernucci, can you provide the following outputs:

cat /proc/cgroups
docker ps | wc -l
systemd-cgls memory | grep docker-containerd-shim | grep -v | wc -l
systemd-cgls memory | grep pod | grep -v grep |grep -v kubepods | wc -l

@edernucci
Copy link
Author

@seanknox sure!

  • 3 nodes
  • kubernetes 1.9.6
  • a mix of long life processes, small cronjobs and rare highmem containers (like elsticsearch)

All namespaces have LimitRange with cpu and memory quotas, and sometimes I found some OOMKilled containers. I'm suspecting we have cgroup leak when the container is killed.

@edernucci
Copy link
Author

@dsalamancaMS unfortunately the error appear after some hours/days. But here is how it are now:

:~$ cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	9	75	1
cpu	7	655	1
cpuacct	7	655	1
blkio	8	655	1
memory	11	1872	1
devices	10	655	1
freezer	2	75	1
net_cls	4	75	1
perf_event	12	75	1
net_prio	4	75	1
hugetlb	6	75	1
pids	5	656	1
rdma	3	1	1
:~$ docker ps | wc -l
48
:~$  systemd-cgls memory | grep docker-containerd-shim | grep -v | wc -l
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
0
:~$ systemd-cgls memory | grep docker-containerd-shim | wc -l
49
:~$ systemd-cgls memory | grep pod | grep -v grep |grep -v kubepods | wc -l
23

@edernucci
Copy link
Author

Still rebooting nodes on a daily basis to workaround this issue.

@seanknox
Copy link
Contributor

@edernucci can you open an support ticket in the Azure Portal? That will get it in front of our engineering team.

@edernucci
Copy link
Author

edernucci commented Aug 30, 2018

@seanknox Microsoft support stated (support id 118081018768501) that this issue is kubernetes-related or kernel-related and is out of scope of AKS support. Please reopen the issue in order to keep track on open-source ecosystem.

Regards,

@edernucci
Copy link
Author

#63

@junaid-ali
Copy link

junaid-ali commented Aug 18, 2019

@edernucci still facing this issue on AKS (version: 1.13.7). It appears the issue is due to inotify reaching its limit. Whenever more than ~15 pods are created on a node (Standard D16s v3 - 16 vcpus, 64 GiB memory), we start seeing this issue. I was able to workaround by increasing the inotify limit to 16384 from the default 8192

$ sudo sysctl -w fs.inotify.max_user_watches=16384 && sudo sysctl -p

@mlushpenko
Copy link

Take a look at this script to do it for all nodes https://gist.github.com/brendan-rius/5ac9ec3dd7e196222c8b8b356f8973d2

@edernucci
Copy link
Author

Microsoft finally found the root cause of this issue: #1373

@jnoller
Copy link
Contributor

jnoller commented Jan 16, 2020

@edernucci issue 1373 doesn't fix the file handle limits

@junaid-ali
Copy link

@edernucci issue should have fixed Azure/aks-engine#1801

@ghost ghost locked as resolved and limited conversation to collaborators Aug 4, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants