Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Restarts on 1.6.1 (EKS 1.15) #1054

Closed
InAnimaTe opened this issue Jun 24, 2020 · 5 comments
Closed

Container Restarts on 1.6.1 (EKS 1.15) #1054

InAnimaTe opened this issue Jun 24, 2020 · 5 comments

Comments

@InAnimaTe
Copy link

InAnimaTe commented Jun 24, 2020

I'm seeing two types of errors from readiness and liveness probes in our Kubernetes event stream (provided by Datadog):

  1. Unhealthy: Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "read init-p: connection reset by peer": unknown
  2. Unhealthy: Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown

This Pod, aws-node-wfx2t is currently running and shows 8 restarts. Here's its Describe output with information on the Last State:

Name:                 aws-node-wfx2t
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-192-168-41-79.ec2.internal/192.168.41.79
Start Time:           Thu, 07 May 2020 12:43:13 -0400
Labels:               controller-revision-hash=6d67c8dffc
                      k8s-app=aws-node
                      pod-template-generation=4
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   192.168.41.79
IPs:                  <none>
Controlled By:        DaemonSet/aws-node
Containers:
  aws-node:
    Container ID:   docker://8904a66e2f32732f1507c2f002a55efb75581a87bf6cdf1f8ecb1adb9c82dfd1
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.6.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:d50a182475c5ee6c18c3b81b01aa649367f30fb0dc60f7a619dcdbf45e10b3a3
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Running
      Started:      Sat, 13 Jun 2020 11:09:24 -0400
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      ttrpc: client shutting down: ttrpc: closed: unknown
      Exit Code:    128
      Started:      Sat, 13 Jun 2020 11:07:52 -0400
      Finished:     Sat, 13 Jun 2020 11:07:52 -0400
    Ready:          True
    Restart Count:  8
    Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:  prng
      MINIMUM_IP_TARGET:              15
      WARM_IP_TARGET:                 7
      AWS_VPC_K8S_CNI_LOGLEVEL:       DEBUG
      AWS_VPC_K8S_CNI_VETHPREFIX:     eni
      AWS_VPC_ENI_MTU:                9001
      MY_NODE_NAME:                    (v1:spec.nodeName)
...snipped...

I was inspired to take a look at our instances by seeing #1038 although I don't think this is the same issue.

See also #1055 also impacting our production workloads :(

@mogren
Copy link
Contributor

mogren commented Jun 25, 2020

Hi @InAnimaTe, I wonder if you could get throttled by EC2? That would cause the check in entrypoint.sh to eventually fail, before the liveness probe fails. We just merged some changes related to this, #874 and #1028. With these two changes, at least the liveness probe can easily be configured and we could give the aws-node pods a bit more time to start. The main reason for them to fail is if calling EC2 fails (throttled or permission issues), or if kube-proxy takes a long time to start up. The CNI currently relies on kube-proxy to set the correct iptables rules for the API server endpoint on the node before it can start up.

@InAnimaTe
Copy link
Author

InAnimaTe commented Jun 30, 2020

Hi @InAnimaTe, I wonder if you could get throttled by EC2? That would cause the check in entrypoint.sh to eventually fail, before the liveness probe fails. We just merged some changes related to this, #874 and #1028. With these two changes, at least the liveness probe can easily be configured and we could give the aws-node pods a bit more time to start. The main reason for them to fail is if calling EC2 fails (throttled or permission issues), or if kube-proxy takes a long time to start up. The CNI currently relies on kube-proxy to set the correct iptables rules for the API server endpoint on the node before it can start up.

Gotcha, so you're suggesting going to 1.6.3 which incorporates those changes, and then tuning up the initialDelaySeconds on the liveness probe?

EDIT: Actually, I can just wait until the next release that includes #1028 I suppose. Let me know your thoughts (and when that might come out)

@mogren
Copy link
Contributor

mogren commented Sep 4, 2020

@InAnimaTe Hey if this was a test cluster, would you be interesting in testing the latest release candidate v1.7.2-rc1 to see if this resolves the issue you saw here? In particular, #1186 changes the start up behavior by actually waiting for iptables to be available to update.

@mogren
Copy link
Contributor

mogren commented Sep 23, 2020

@InAnimaTe Hi, have you tried with v1.7.2 or later versions? Are you still seeing this restart issue?

@InAnimaTe
Copy link
Author

@InAnimaTe Hi, have you tried with v1.7.2 or later versions? Are you still seeing this restart issue?

Hey @mogren apologies for the delays. I've been on vacation and incredibly busy with some other work things lately. I have not tried 1.7.2 yet but I plan to start testing it over the next couple weeks. I'm going to close this issue for now and if I see the problem again, we can re-open. I strongly suspect that #1186 and other changes (#1028) your team has made will probably solve these issues for us.

Thanks so much for your continued support with these issues and the great work to make the AWS CNI better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants