-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
/var/lib/containerd filling up on GPU worker nodes #76
Comments
how long does it take for this problem to start occurring? |
It usually happens when I try to deploy a replacement docker image in
Yes, though I am not necessarily sure there is a causal relationship. In order to resolve this problem, I have created a 150GB external mount. I then soft link |
it looks like This should be configured in However, it seems like this feature was introduced in 2.22, kubernetes-sigs/kubespray#10075, while we are using 2.21. Actually this feature is to modify the default values, however the feature should be already there. But if I go and look at apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
enabled: false
webhook:
cacheTTL: 0s
enabled: true
x509:
clientCAFile: /etc/kubernetes/ssl/ca.crt
authorization:
mode: Webhook
webhook:
cacheAuthorizedTTL: 0s
cacheUnauthorizedTTL: 0s
cgroupDriver: systemd
clusterDNS:
- 169.254.25.10
clusterDomain: cluster.local
cpuManagerReconcilePeriod: 0s
evictionPressureTransitionPeriod: 0s
fileCheckFrequency: 0s
healthzBindAddress: 127.0.0.1
healthzPort: 10248
httpCheckFrequency: 0s
imageMinimumGCAge: 0s
kind: KubeletConfiguration
logging:
flushFrequency: 0
options:
json:
infoBufferSize: "0"
verbosity: 0
memorySwap: {}
nodeStatusReportFrequency: 0s
nodeStatusUpdateFrequency: 0s
resolvConf: /run/systemd/resolve/resolv.conf
rotateCertificates: true
runtimeRequestTimeout: 0s
shutdownGracePeriod: 0s
shutdownGracePeriodCriticalPods: 0s
staticPodPath: /etc/kubernetes/manifests
streamingConnectionIdleTimeout: 0s
syncFrequency: 0s
volumeStatsAggPeriod: 0s I do not see the |
any new insight here? otherwise I can make a test and see if I can reproduce it in a simple deployment. |
Thanks for following up. I ended up creating an external
and also in It is during these subsequent docker image deployments when |
@julienchastang can you point me to the nvidia images you are using? I need one image with 3 or 4 tags or multiple images so I can do some testing. |
For testing purposes, you may be able to use this Dockerfile. |
Actually, here are some images as well: Those are based on the large Nvidia base images I was talking about. |
@julienchastang as you are deploying GPU instances again, please notify me if you are experiencing again this issue and I can try do some targeted testing. |
OK, we will keep you up-to-date. I think we are moving away from the NVIDIA-based images which already helps a lot. |
cc: @ana-v-espinoza
We've been noticing nodes associated with GPU clusters exhibiting node pressure errors. Tracking this down further we have determined that
/var/lib/containerd
is filling up unexpectedly on the worker nodes. Note that the GPU image we are deploying is large (~12GB) but certainly cannot account for what we are seeing below. As a result, the cluster basically just gets stuck in a loop trying to download the images in needs, failing, purging some, but not all containerd images and starting over. (Note: the snippets below were not captured at exactly the same time so there may be some discrepancy in the size of the offending directories.)In addition,
The text was updated successfully, but these errors were encountered: