No space left on device #528

edernucci · 2018-07-13T16:47:04Z

Hi there,

I'm experiencing this issue and seems to be related (or exactly) this one:

My machines have a good disk free space and a good inode usage count. Every time I face this issue on kubelet log my /proc/cgroups are using 2500+ cgroups and I have to drain and restart the node. After reboot the cgroups usage is at 100 again and after some hours (sometimes days) the error appear again.

cpuguy83 · 2018-07-16T22:32:38Z

/cc @seanknox

seanknox · 2018-07-17T01:05:17Z

@edernucci can you provide some information about your cluster?

node count
kubernetes version
description of cluster workloads

dsalamancaMS · 2018-07-19T22:23:22Z

@edernucci, can you provide the following outputs:

edernucci · 2018-07-23T21:26:05Z

@seanknox sure!

3 nodes
kubernetes 1.9.6
a mix of long life processes, small cronjobs and rare highmem containers (like elsticsearch)

All namespaces have LimitRange with cpu and memory quotas, and sometimes I found some OOMKilled containers. I'm suspecting we have cgroup leak when the container is killed.

edernucci · 2018-07-23T21:51:58Z

@dsalamancaMS unfortunately the error appear after some hours/days. But here is how it are now:

:~$ cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	9	75	1
cpu	7	655	1
cpuacct	7	655	1
blkio	8	655	1
memory	11	1872	1
devices	10	655	1
freezer	2	75	1
net_cls	4	75	1
perf_event	12	75	1
net_prio	4	75	1
hugetlb	6	75	1
pids	5	656	1
rdma	3	1	1
:~$ docker ps | wc -l
48
:~$  systemd-cgls memory | grep docker-containerd-shim | grep -v | wc -l
Usage: grep [OPTION]... PATTERN [FILE]...
Try 'grep --help' for more information.
0
:~$ systemd-cgls memory | grep docker-containerd-shim | wc -l
49
:~$ systemd-cgls memory | grep pod | grep -v grep |grep -v kubepods | wc -l
23

edernucci · 2018-08-08T16:35:33Z

Still rebooting nodes on a daily basis to workaround this issue.

seanknox · 2018-08-10T01:11:13Z

@edernucci can you open an support ticket in the Azure Portal? That will get it in front of our engineering team.

edernucci · 2018-08-30T18:28:24Z

@seanknox Microsoft support stated (support id 118081018768501) that this issue is kubernetes-related or kernel-related and is out of scope of AKS support. Please reopen the issue in order to keep track on open-source ecosystem.

Regards,

edernucci · 2018-11-15T12:58:32Z

#63

junaid-ali · 2019-08-18T16:00:50Z

@edernucci still facing this issue on AKS (version: 1.13.7). It appears the issue is due to inotify reaching its limit. Whenever more than ~15 pods are created on a node (Standard D16s v3 - 16 vcpus, 64 GiB memory), we start seeing this issue. I was able to workaround by increasing the inotify limit to 16384 from the default 8192

$ sudo sysctl -w fs.inotify.max_user_watches=16384 && sudo sysctl -p

mlushpenko · 2019-10-16T12:52:57Z

Take a look at this script to do it for all nodes https://gist.github.com/brendan-rius/5ac9ec3dd7e196222c8b8b356f8973d2

edernucci · 2020-01-16T01:09:14Z

Microsoft finally found the root cause of this issue: #1373

jnoller · 2020-01-16T14:55:19Z

@edernucci issue 1373 doesn't fix the file handle limits

junaid-ali · 2020-01-27T11:26:13Z

@edernucci issue should have fixed Azure/aks-engine#1801

seanknox closed this as completed Aug 10, 2018

ghost locked as resolved and limited conversation to collaborators Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No space left on device #528

No space left on device #528

edernucci commented Jul 13, 2018

cpuguy83 commented Jul 16, 2018

seanknox commented Jul 17, 2018

dsalamancaMS commented Jul 19, 2018

edernucci commented Jul 23, 2018

edernucci commented Jul 23, 2018

edernucci commented Aug 8, 2018

seanknox commented Aug 10, 2018

edernucci commented Aug 30, 2018 •

edited

Loading

edernucci commented Nov 15, 2018

junaid-ali commented Aug 18, 2019 •

edited

Loading

mlushpenko commented Oct 16, 2019

edernucci commented Jan 16, 2020

jnoller commented Jan 16, 2020

junaid-ali commented Jan 27, 2020

No space left on device #528

No space left on device #528

Comments

edernucci commented Jul 13, 2018

cpuguy83 commented Jul 16, 2018

seanknox commented Jul 17, 2018

dsalamancaMS commented Jul 19, 2018

edernucci commented Jul 23, 2018

edernucci commented Jul 23, 2018

edernucci commented Aug 8, 2018

seanknox commented Aug 10, 2018

edernucci commented Aug 30, 2018 • edited Loading

edernucci commented Nov 15, 2018

junaid-ali commented Aug 18, 2019 • edited Loading

mlushpenko commented Oct 16, 2019

edernucci commented Jan 16, 2020

jnoller commented Jan 16, 2020

junaid-ali commented Jan 27, 2020

edernucci commented Aug 30, 2018 •

edited

Loading

junaid-ali commented Aug 18, 2019 •

edited

Loading