Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sandbox pause container not deleted #113073

Closed
jdn5126 opened this issue Oct 14, 2022 · 11 comments
Closed

Sandbox pause container not deleted #113073

jdn5126 opened this issue Oct 14, 2022 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@jdn5126
Copy link

jdn5126 commented Oct 14, 2022

What happened?

A pod deletion did not result in the correct sandbox pause container being deleted. This was found during internal EKS testing on 1.22, where the test case was as follows:

  1. Create a deployment with ~20 replicas and target a single Linux node.
  2. Wait for all pods to have networking resources allocated
  3. Delete the deployment
  4. Validate that networking resources are cleaned up

What is unique about this test is that each pod sandbox fails to get an IP address assigned on average ~100 times before succeeding. So kubelet keeps trying to setup a pod sandbox using a new container ID each time. In the failure scenario, when kubelet finally sees the CNI succeed and the pod starts, it appears that kubelet is tracking an old container ID for the sandbox pod.

In the attached logs, the pod in question is deployment-test-6655f9df8c-jc8gv, and the sandbox create succeeds with container ID 49b6b37d511bcf993e0757afe45f55ee9d1ed2246834e055ba7f32fe7b20878e, but kubelet appears to be tracking an old container ID, c0fe93feba2615c06cac681f00b81d5e3a45ddab47714fbdb8bfc0d38253b720.

When the pod is deleted, we never see a delete for the successful container ID, only the old container ID, and so the pod is cleaned up while the container remains running.

eks_i-04f3b29e38d6f7a6e_2022-10-13_0346-UTC_0.7.1.tar.gz

What did you expect to happen?

I expect pod deletion to delete all sandbox pods. In this case, I expected a delete for 49b6b37d511bcf993e0757afe45f55ee9d1ed2246834e055ba7f32fe7b20878e to be issued.

How can we reproduce it (as minimally and precisely as possible)?

This is the challenging part. The steps we used to reproduce this with a 1/3 success rate were:

  1. Create a deployment with ~20 replicas and set node affinity to a single Linux node
  2. Configure CNI such that pods fail to get IP addresses for some time as CNI warms up
  3. Delete deployment as soon as all pods are running successfully

Anything else we need to know?

So far, we have only been able to reproduce this on 1.22. We were able to reproduce with kubelet logging verbosity set to 10, but enabling any containerd debug logging resulted in this not being reproducible. From the resulting logs, we were unable to confirm whether this was a kubelet or containerd issue.

Also, this issue does seem similar to #110181. .

Kubernetes version

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23+", GitVersion:"v1.23.7-eks-4721010", GitCommit:"b77d9473a02fbfa834afa67d677fd12d690b195f", GitTreeState:"clean", BuildDate:"2022-06-27T22:22:16Z", GoVersion:"go1.17.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.13-eks-15b7512", GitCommit:"94138dfbea757d7aaf3b205419578ef186dd5efb", GitTreeState:"clean", BuildDate:"2022-08-31T19:15:48Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider

EKS

OS version

$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
$ uname -a
Linux ip-192-168-14-22.us-west-2.compute.internal 5.4.209-116.367.amzn2.x86_64 #1 SMP Wed Aug 31 00:09:52 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Install tools

eksctl

Container runtime (CRI) and version (if applicable)

docker://20.10.17

Related plugins (CNI, CSI, ...) and versions (if applicable)

[CNI](https://github.com/aws/amazon-vpc-cni-k8s) version 1.11.4
@jdn5126 jdn5126 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 14, 2022
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 14, 2022
@jdn5126
Copy link
Author

jdn5126 commented Oct 14, 2022

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 14, 2022
@BenTheElder
Copy link
Member

BenTheElder commented Oct 18, 2022

Container runtime (CRI) and version (if applicable)
Details
docker://20.10.17

Is this dockershim?

It's worth noting that dockershim was removed from kubelet in 1.24, and 1.22 is end of life in just 11 days (october 28th) with 1.25 being current and 1.26 on the way:

https://kubernetes.io/releases/

(It's possible this bug may need patching in 1.23 though, which is EOL February 23rd, 2023, but few people work on dockershim at this point)

@jdn5126
Copy link
Author

jdn5126 commented Oct 18, 2022

Container runtime (CRI) and version (if applicable)
Details
docker://20.10.17

Is this dockershim?

It's worth noting that dockershim was removed from kubelet in 1.24, and 1.22 is end of life in just 11 days (october 28th) with 1.25 being current and 1.26 on the way:

https://kubernetes.io/releases/

(It's possible this bug may need patching in 1.23 though, which is EOL February 23rd, 2023, but few people work on dockershim at this point)

Yes, this is dockershim, and I believe this exists in 1.23, though I have not been able to reproduce. I was hoping the logs could point to whether the issue has anything to do with dockershim or not.

@BenTheElder
Copy link
Member

IIRC: In CRI kubelet asks for the pod sandbox to be removed (which includes the pause container) and the pause container is no longer a detail of kubelet but a detail of the CRI.

@jdn5126
Copy link
Author

jdn5126 commented Oct 18, 2022

Yeah, so what I was unable to tell from the logs is if kubelet was requesting the correct pod sandbox to be removed, or if it was requesting the wrong one, leading to the container persisting.

@SergeyKanzhelev
Copy link
Member

/triage accepted

Even though triage is accepted we may have hard time having somebody to look into dockershim issues. Is it possible to switch to Containerd as a runtime?

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 19, 2022
@jdn5126
Copy link
Author

jdn5126 commented Oct 19, 2022

As I understand it, that is the long-term plan for EKS, but only for 1.24+

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2023
@pankajyadav2741
Copy link

We are facing same issue with containerd in K8s v1.23.3.
We have raised similar bug for it : #115252

@SergeyKanzhelev
Copy link
Member

/close

dockershim is out of support in OSS Kubernetes

@k8s-ci-robot
Copy link
Contributor

@SergeyKanzhelev: Closing this issue.

In response to this:

/close

dockershim is out of support in OSS Kubernetes

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/node Categorizes an issue or PR as relevant to SIG Node. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Archived in project
Development

No branches or pull requests

6 participants