Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

/var/lib/containerd filling up on GPU worker nodes #76

Open
julienchastang opened this issue Feb 23, 2024 · 10 comments
Open

/var/lib/containerd filling up on GPU worker nodes #76

julienchastang opened this issue Feb 23, 2024 · 10 comments

Comments

@julienchastang
Copy link
Contributor

cc: @ana-v-espinoza

We've been noticing nodes associated with GPU clusters exhibiting node pressure errors. Tracking this down further we have determined that /var/lib/containerd is filling up unexpectedly on the worker nodes. Note that the GPU image we are deploying is large (~12GB) but certainly cannot account for what we are seeing below. As a result, the cluster basically just gets stuck in a loop trying to download the images in needs, failing, purging some, but not all containerd images and starting over. (Note: the snippets below were not captured at exactly the same time so there may be some discrepancy in the size of the offending directories.)

17G     /var/lib/containerd/io.containerd.content.v1.content
27G     /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
46G     /var/lib/containerd

In addition,

Filesystem      Size  Used Avail Use% Mounted on
udev             15G     0   15G   0% /dev
tmpfs           3.0G  2.5M  3.0G   1% /run
/dev/sda1        58G   57G  1.4G  98% /
tmpfs            15G     0   15G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            15G     0   15G   0% /sys/fs/cgroup
/dev/sda15      105M  6.1M   99M   6% /boot/efi
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/71c9f19bc86a81430693b763b10177884fba0cdd17c0c207cdc7021108912643/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/e15de4902e59d9eb4978eba7c6501549c86c9c25e5e90c0d0b079a7f824a9003/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/eca7fc38cfb40ee5e7687478a73c20d50417063f356f33d3953ab0c713a5cbc7/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/ca036f6c4e29bf1a4800a92ae04923268f8e97cc1ff8ae304cc74b104745ce48/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/6ff63ac980e5a4edd71cf78d04126c70e891abcc44ee1de720004d8e37e7758c/shm
tmpfs           3.0G  4.0K  3.0G   1% /run/user/1000
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/83240f98f64048afe29b6e3fdc42e79aa9375a7e64a2a11320a14aee877127d3/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/22509265ec4fb6a33671583f958d2ba1033b062cf7691583835ac1ffa09accd8/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/5e512b51759053821414b565443ea9d45b5ce027aee9d6dc7195c932b05a1571/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/256211fead54eab13257f02d6d04361e4af2a9a9cf18e0a5475fe8972b4b4541/shm
shm              64M     0   64M   0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/d90c1af1c2117a1df9d414e00a3b48b53c12901b44e2d5881f6133f3e9b24992/shm
@zonca
Copy link
Owner

zonca commented Feb 24, 2024

how long does it take for this problem to start occurring?
only ever happens on GPU nodes?

@julienchastang
Copy link
Contributor Author

how long does it take for this problem to start occurring?

It usually happens when I try to deploy a replacement docker image in secrets.yaml. After that, it gets stuck in the loop I described earlier of purge (but not enough) and filling up again.

only ever happens on GPU nodes?

Yes, though I am not necessarily sure there is a causal relationship.

In order to resolve this problem, I have created a 150GB external mount. I then soft link /var/lib/containerd to that external mount point. The weird thing is that for each subsequent docker image deployment in secrets.yaml, K8s does not seem to purge the old image so even though there is now a large volume to accommodate /var/lib/containerd, it is just filling up again and never purging. At what point do old images purge? Can you manually purge as a stop gap measure?

@zonca
Copy link
Owner

zonca commented Feb 27, 2024

it looks like kubelet should take care of this:
https://kubernetes.io/docs/concepts/architecture/garbage-collection/#container-image-lifecycle

This should be configured in kubespray with kubelet_image_gc_low_threshold https://kubespray.io/#/docs/vars?id=other-service-variables

However, it seems like this feature was introduced in 2.22, kubernetes-sigs/kubespray#10075, while we are using 2.21.

Actually this feature is to modify the default values, however the feature should be already there.

But if I go and look at /var/lib/kubelet/config.yaml in the master node:

apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
  anonymous:
    enabled: false
  webhook:
    cacheTTL: 0s
    enabled: true
  x509:
    clientCAFile: /etc/kubernetes/ssl/ca.crt
authorization:
  mode: Webhook
  webhook:
    cacheAuthorizedTTL: 0s
    cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 169.254.25.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
  flushFrequency: 0
  options:
    json:
      infoBufferSize: "0"
  verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s

I do not see the HighThresholdPercent keyword, maybe you can try to set that manually in the config and restart kubelet?

@zonca
Copy link
Owner

zonca commented Mar 13, 2024

any new insight here? otherwise I can make a test and see if I can reproduce it in a simple deployment.

@julienchastang
Copy link
Contributor Author

Thanks for following up. I ended up creating an external 150GB disk mount via openstack and attaching it to the worker node to accommodate /var/lib/containerd. This is slightly tedious, but works in the short term. This is a bit of a kludge, however, and probably not a good long term solution. I think the problem is the Nvidia blessed GPU base images tend to be large. If trying to reproduce, I would suggest deploying the GPU JupyterHub then deploying subsequent docker image referred to in secrets.yaml. E.g.,

  image:
    name: "unidata/testing123"
    tag: "xxx"

and also in jupyterhub_gpu.yaml.

It is during these subsequent docker image deployments when /var/lib/containerd seems to fill up. Ultimately, I think the VM disk is somehow not big enough to accommodate these large images even if garbage collection may be working as expected. Maybe some gc tweaking could help, but I have not tried that yet.

@zonca
Copy link
Owner

zonca commented Mar 13, 2024

@julienchastang can you point me to the nvidia images you are using? I need one image with 3 or 4 tags or multiple images so I can do some testing.

@julienchastang
Copy link
Contributor Author

ARG BASE_CONTAINER=nvcr.io/nvidia/tensorflow:24.01-tf2-py3

For testing purposes, you may be able to use this Dockerfile.

@julienchastang
Copy link
Contributor Author

julienchastang commented Mar 13, 2024

Actually, here are some images as well:

Those are based on the large Nvidia base images I was talking about.

@zonca
Copy link
Owner

zonca commented Nov 21, 2024

@julienchastang as you are deploying GPU instances again, please notify me if you are experiencing again this issue and I can try do some targeted testing.

@julienchastang
Copy link
Contributor Author

OK, we will keep you up-to-date. I think we are moving away from the NVIDIA-based images which already helps a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants