/var/lib/containerd filling up on GPU worker nodes #76

julienchastang · 2024-02-23T03:06:43Z

We've been noticing nodes associated with GPU clusters exhibiting node pressure errors. Tracking this down further we have determined that /var/lib/containerd is filling up unexpectedly on the worker nodes. Note that the GPU image we are deploying is large (~12GB) but certainly cannot account for what we are seeing below. As a result, the cluster basically just gets stuck in a loop trying to download the images in needs, failing, purging some, but not all containerd images and starting over. (Note: the snippets below were not captured at exactly the same time so there may be some discrepancy in the size of the offending directories.)

17G     /var/lib/containerd/io.containerd.content.v1.content
27G     /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
46G     /var/lib/containerd

In addition,

Filesystem      Size  Used Avail Use% Mounted on
udev             15G     0   15G   0% /dev
tmpfs           3.0G  2.5M  3.0G   1% /run
/dev/sda1        58G   57G  1.4G  98% /
tmpfs            15G     0   15G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            15G     0   15G   0% /sys/fs/cgroup
/dev/sda15      105M  6.1M   99M   6% /boot/efi
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/71c9f19bc86a81430693b763b10177884fba0cdd17c0c207cdc7021108912643/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/e15de4902e59d9eb4978eba7c6501549c86c9c25e5e90c0d0b079a7f824a9003/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/eca7fc38cfb40ee5e7687478a73c20d50417063f356f33d3953ab0c713a5cbc7/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/ca036f6c4e29bf1a4800a92ae04923268f8e97cc1ff8ae304cc74b104745ce48/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/6ff63ac980e5a4edd71cf78d04126c70e891abcc44ee1de720004d8e37e7758c/shm
tmpfs           3.0G  4.0K  3.0G   1% /run/user/1000
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/83240f98f64048afe29b6e3fdc42e79aa9375a7e64a2a11320a14aee877127d3/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/22509265ec4fb6a33671583f958d2ba1033b062cf7691583835ac1ffa09accd8/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/5e512b51759053821414b565443ea9d45b5ce027aee9d6dc7195c932b05a1571/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/256211fead54eab13257f02d6d04361e4af2a9a9cf18e0a5475fe8972b4b4541/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/d90c1af1c2117a1df9d414e00a3b48b53c12901b44e2d5881f6133f3e9b24992/shm

The text was updated successfully, but these errors were encountered:

zonca · 2024-02-24T23:30:09Z

how long does it take for this problem to start occurring?
only ever happens on GPU nodes?

julienchastang · 2024-02-26T16:41:51Z

how long does it take for this problem to start occurring?

It usually happens when I try to deploy a replacement docker image in secrets.yaml. After that, it gets stuck in the loop I described earlier of purge (but not enough) and filling up again.

only ever happens on GPU nodes?

Yes, though I am not necessarily sure there is a causal relationship.

In order to resolve this problem, I have created a 150GB external mount. I then soft link /var/lib/containerd to that external mount point. The weird thing is that for each subsequent docker image deployment in secrets.yaml, K8s does not seem to purge the old image so even though there is now a large volume to accommodate /var/lib/containerd, it is just filling up again and never purging. At what point do old images purge? Can you manually purge as a stop gap measure?

zonca · 2024-02-27T07:30:58Z

it looks like kubelet should take care of this:
https://kubernetes.io/docs/concepts/architecture/garbage-collection/#container-image-lifecycle

This should be configured in kubespray with kubelet_image_gc_low_threshold https://kubespray.io/#/docs/vars?id=other-service-variables

However, it seems like this feature was introduced in 2.22, kubernetes-sigs/kubespray#10075, while we are using 2.21.

Actually this feature is to modify the default values, however the feature should be already there.

But if I go and look at /var/lib/kubelet/config.yaml in the master node:

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/ssl/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 169.254.25.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
  flushFrequency: 0
  options:
    json:
      infoBufferSize: "0"
  verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

I do not see the HighThresholdPercent keyword, maybe you can try to set that manually in the config and restart kubelet?

zonca · 2024-03-13T19:31:13Z

any new insight here? otherwise I can make a test and see if I can reproduce it in a simple deployment.

julienchastang · 2024-03-13T22:01:28Z

Thanks for following up. I ended up creating an external 150GB disk mount via openstack and attaching it to the worker node to accommodate /var/lib/containerd. This is slightly tedious, but works in the short term. This is a bit of a kludge, however, and probably not a good long term solution. I think the problem is the Nvidia blessed GPU base images tend to be large. If trying to reproduce, I would suggest deploying the GPU JupyterHub then deploying subsequent docker image referred to in secrets.yaml. E.g.,

  image:
    name: "unidata/testing123"
    tag: "xxx"

and also in jupyterhub_gpu.yaml.

It is during these subsequent docker image deployments when /var/lib/containerd seems to fill up. Ultimately, I think the VM disk is somehow not big enough to accommodate these large images even if garbage collection may be working as expected. Maybe some gc tweaking could help, but I have not tried that yet.

zonca · 2024-03-13T23:32:01Z

@julienchastang can you point me to the nvidia images you are using? I need one image with 3 or 4 tags or multiple images so I can do some testing.

julienchastang · 2024-03-13T23:47:37Z

ARG BASE_CONTAINER=nvcr.io/nvidia/tensorflow:24.01-tf2-py3

For testing purposes, you may be able to use this Dockerfile.

julienchastang · 2024-03-13T23:52:35Z

Actually, here are some images as well:

Those are based on the large Nvidia base images I was talking about.

zonca · 2024-11-21T05:39:47Z

@julienchastang as you are deploying GPU instances again, please notify me if you are experiencing again this issue and I can try do some targeted testing.

julienchastang · 2024-11-21T16:23:52Z

OK, we will keep you up-to-date. I think we are moving away from the NVIDIA-based images which already helps a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/var/lib/containerd filling up on GPU worker nodes #76

/var/lib/containerd filling up on GPU worker nodes #76

julienchastang commented Feb 23, 2024

zonca commented Feb 24, 2024

julienchastang commented Feb 26, 2024

zonca commented Feb 27, 2024

zonca commented Mar 13, 2024

julienchastang commented Mar 13, 2024

zonca commented Mar 13, 2024

julienchastang commented Mar 13, 2024

julienchastang commented Mar 13, 2024 •

edited

Loading

zonca commented Nov 21, 2024

julienchastang commented Nov 21, 2024

/var/lib/containerd filling up on GPU worker nodes #76

/var/lib/containerd filling up on GPU worker nodes #76

Comments

julienchastang commented Feb 23, 2024

zonca commented Feb 24, 2024

julienchastang commented Feb 26, 2024

zonca commented Feb 27, 2024

zonca commented Mar 13, 2024

julienchastang commented Mar 13, 2024

zonca commented Mar 13, 2024

julienchastang commented Mar 13, 2024

julienchastang commented Mar 13, 2024 • edited Loading

zonca commented Nov 21, 2024

julienchastang commented Nov 21, 2024

julienchastang commented Mar 13, 2024 •

edited

Loading