Sandbox pause container not deleted #113073

jdn5126 · 2022-10-14T19:01:59Z

What happened?

A pod deletion did not result in the correct sandbox pause container being deleted. This was found during internal EKS testing on 1.22, where the test case was as follows:

Create a deployment with ~20 replicas and target a single Linux node.
Wait for all pods to have networking resources allocated
Delete the deployment
Validate that networking resources are cleaned up

What is unique about this test is that each pod sandbox fails to get an IP address assigned on average ~100 times before succeeding. So kubelet keeps trying to setup a pod sandbox using a new container ID each time. In the failure scenario, when kubelet finally sees the CNI succeed and the pod starts, it appears that kubelet is tracking an old container ID for the sandbox pod.

In the attached logs, the pod in question is deployment-test-6655f9df8c-jc8gv, and the sandbox create succeeds with container ID 49b6b37d511bcf993e0757afe45f55ee9d1ed2246834e055ba7f32fe7b20878e, but kubelet appears to be tracking an old container ID, c0fe93feba2615c06cac681f00b81d5e3a45ddab47714fbdb8bfc0d38253b720.

When the pod is deleted, we never see a delete for the successful container ID, only the old container ID, and so the pod is cleaned up while the container remains running.

eks_i-04f3b29e38d6f7a6e_2022-10-13_0346-UTC_0.7.1.tar.gz

What did you expect to happen?

I expect pod deletion to delete all sandbox pods. In this case, I expected a delete for 49b6b37d511bcf993e0757afe45f55ee9d1ed2246834e055ba7f32fe7b20878e to be issued.

How can we reproduce it (as minimally and precisely as possible)?

This is the challenging part. The steps we used to reproduce this with a 1/3 success rate were:

Create a deployment with ~20 replicas and set node affinity to a single Linux node
Configure CNI such that pods fail to get IP addresses for some time as CNI warms up
Delete deployment as soon as all pods are running successfully

Anything else we need to know?

So far, we have only been able to reproduce this on 1.22. We were able to reproduce with kubelet logging verbosity set to 10, but enabling any containerd debug logging resulted in this not being reproducible. From the resulting logs, we were unable to confirm whether this was a kubelet or containerd issue.

Also, this issue does seem similar to #110181. .

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.7-eks-4721010", GitCommit:"b77d9473a02fbfa834afa67d677fd12d690b195f", GitTreeState:"clean", BuildDate:"2022-06-27T22:22:16Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.13-eks-15b7512", GitCommit:"94138dfbea757d7aaf3b205419578ef186dd5efb", GitTreeState:"clean", BuildDate:"2022-08-31T19:15:48Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

EKS

OS version

$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
$ uname -a
Linux ip-192-168-14-22.us-west-2.compute.internal 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Install tools

eksctl

Container runtime (CRI) and version (if applicable)

docker://20.10.17

Related plugins (CNI, CSI, ...) and versions (if applicable)

[CNI](https://github.com/aws/amazon-vpc-cni-k8s) version 1.11.4

The text was updated successfully, but these errors were encountered:

jdn5126 · 2022-10-14T19:04:28Z

/sig node

BenTheElder · 2022-10-18T03:46:45Z

Container runtime (CRI) and version (if applicable)
Details
docker://20.10.17

Is this dockershim?

It's worth noting that dockershim was removed from kubelet in 1.24, and 1.22 is end of life in just 11 days (october 28th) with 1.25 being current and 1.26 on the way:

https://kubernetes.io/releases/

(It's possible this bug may need patching in 1.23 though, which is EOL February 23rd, 2023, but few people work on dockershim at this point)

jdn5126 · 2022-10-18T15:51:53Z

Container runtime (CRI) and version (if applicable)
Details
docker://20.10.17

Is this dockershim?

It's worth noting that dockershim was removed from kubelet in 1.24, and 1.22 is end of life in just 11 days (october 28th) with 1.25 being current and 1.26 on the way:

https://kubernetes.io/releases/

(It's possible this bug may need patching in 1.23 though, which is EOL February 23rd, 2023, but few people work on dockershim at this point)

Yes, this is dockershim, and I believe this exists in 1.23, though I have not been able to reproduce. I was hoping the logs could point to whether the issue has anything to do with dockershim or not.

BenTheElder · 2022-10-18T16:20:41Z

IIRC: In CRI kubelet asks for the pod sandbox to be removed (which includes the pause container) and the pause container is no longer a detail of kubelet but a detail of the CRI.

jdn5126 · 2022-10-18T16:24:56Z

Yeah, so what I was unable to tell from the logs is if kubelet was requesting the correct pod sandbox to be removed, or if it was requesting the wrong one, leading to the container persisting.

SergeyKanzhelev · 2022-10-19T17:40:49Z

/triage accepted

Even though triage is accepted we may have hard time having somebody to look into dockershim issues. Is it possible to switch to Containerd as a runtime?

jdn5126 · 2022-10-19T18:27:37Z

As I understand it, that is the long-term plan for EKS, but only for 1.24+

k8s-triage-robot · 2023-01-17T18:33:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

pankajyadav2741 · 2023-01-24T15:16:35Z

We are facing same issue with containerd in K8s v1.23.3.
We have raised similar bug for it : #115252

SergeyKanzhelev · 2023-03-16T21:52:29Z

/close

dockershim is out of support in OSS Kubernetes

k8s-ci-robot · 2023-03-16T21:52:37Z

@SergeyKanzhelev: Closing this issue.

In response to this:

/close

dockershim is out of support in OSS Kubernetes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jdn5126 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 14, 2022

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2022

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 14, 2022

jdn5126 mentioned this issue Oct 18, 2022

VPC CNI Integration Test Fixes aws/amazon-vpc-cni-k8s#2105

Merged

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 19, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2023

pankajyadav2741 mentioned this issue Jan 22, 2023

Sandbox containers not deleted leaving pods in NotReady state #115252

Closed

pankajyadav2741 mentioned this issue Jan 30, 2023

Sandbox containers not deleted leaving pods in NotReady state containerd/containerd#7988

Closed

k8s-ci-robot closed this as completed Mar 16, 2023

kannon92 added this to SIG Node Bugs Jul 22, 2024

kannon92 moved this to Done in SIG Node Bugs Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sandbox pause container not deleted #113073

Sandbox pause container not deleted #113073

jdn5126 commented Oct 14, 2022 •

edited

Loading

jdn5126 commented Oct 14, 2022

BenTheElder commented Oct 18, 2022 •

edited

Loading

jdn5126 commented Oct 18, 2022

BenTheElder commented Oct 18, 2022

jdn5126 commented Oct 18, 2022

SergeyKanzhelev commented Oct 19, 2022

jdn5126 commented Oct 19, 2022

k8s-triage-robot commented Jan 17, 2023

pankajyadav2741 commented Jan 24, 2023

SergeyKanzhelev commented Mar 16, 2023

k8s-ci-robot commented Mar 16, 2023

Sandbox pause container not deleted #113073

Sandbox pause container not deleted #113073

Comments

jdn5126 commented Oct 14, 2022 • edited Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kubernetes version

Cloud provider

OS version

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

jdn5126 commented Oct 14, 2022

BenTheElder commented Oct 18, 2022 • edited Loading

jdn5126 commented Oct 18, 2022

BenTheElder commented Oct 18, 2022

jdn5126 commented Oct 18, 2022

SergeyKanzhelev commented Oct 19, 2022

jdn5126 commented Oct 19, 2022

k8s-triage-robot commented Jan 17, 2023

pankajyadav2741 commented Jan 24, 2023

SergeyKanzhelev commented Mar 16, 2023

k8s-ci-robot commented Mar 16, 2023

jdn5126 commented Oct 14, 2022 •

edited

Loading

BenTheElder commented Oct 18, 2022 •

edited

Loading