Container Restarts on 1.6.1 (EKS 1.15) #1054

InAnimaTe · 2020-06-24T20:41:33Z

I'm seeing two types of errors from readiness and liveness probes in our Kubernetes event stream (provided by Datadog):

Unhealthy: Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "read init-p: connection reset by peer": unknown
Unhealthy: Liveness probe failed: OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "process_linux.go:101: executing setns process caused \"exit status 1\"": unknown

This Pod, aws-node-wfx2t is currently running and shows 8 restarts. Here's its Describe output with information on the Last State:

Name:                 aws-node-wfx2t
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-192-168-41-79.ec2.internal/192.168.41.79
Start Time:           Thu, 07 May 2020 12:43:13 -0400
Labels:               controller-revision-hash=6d67c8dffc
                      k8s-app=aws-node
                      pod-template-generation=4
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   192.168.41.79
IPs:                  <none>
Controlled By:        DaemonSet/aws-node
Containers:
  aws-node:
    Container ID:   docker://8904a66e2f32732f1507c2f002a55efb75581a87bf6cdf1f8ecb1adb9c82dfd1
    Image:          602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.6.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni@sha256:d50a182475c5ee6c18c3b81b01aa649367f30fb0dc60f7a619dcdbf45e10b3a3
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Running
      Started:      Sat, 13 Jun 2020 11:09:24 -0400
    Last State:     Terminated
      Reason:       ContainerCannotRun
      Message:      ttrpc: client shutting down: ttrpc: closed: unknown
      Exit Code:    128
      Started:      Sat, 13 Jun 2020 11:07:52 -0400
      Finished:     Sat, 13 Jun 2020 11:07:52 -0400
    Ready:          True
    Restart Count:  8
    Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Environment:
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:  prng
      MINIMUM_IP_TARGET:              15
      WARM_IP_TARGET:                 7
      AWS_VPC_K8S_CNI_LOGLEVEL:       DEBUG
      AWS_VPC_K8S_CNI_VETHPREFIX:     eni
      AWS_VPC_ENI_MTU:                9001
      MY_NODE_NAME:                    (v1:spec.nodeName)
...snipped...

I was inspired to take a look at our instances by seeing #1038 although I don't think this is the same issue.

See also #1055 also impacting our production workloads :(

The text was updated successfully, but these errors were encountered:

mogren · 2020-06-25T21:03:36Z

Hi @InAnimaTe, I wonder if you could get throttled by EC2? That would cause the check in entrypoint.sh to eventually fail, before the liveness probe fails. We just merged some changes related to this, #874 and #1028. With these two changes, at least the liveness probe can easily be configured and we could give the aws-node pods a bit more time to start. The main reason for them to fail is if calling EC2 fails (throttled or permission issues), or if kube-proxy takes a long time to start up. The CNI currently relies on kube-proxy to set the correct iptables rules for the API server endpoint on the node before it can start up.

InAnimaTe · 2020-06-30T01:53:08Z

Hi @InAnimaTe, I wonder if you could get throttled by EC2? That would cause the check in entrypoint.sh to eventually fail, before the liveness probe fails. We just merged some changes related to this, #874 and #1028. With these two changes, at least the liveness probe can easily be configured and we could give the aws-node pods a bit more time to start. The main reason for them to fail is if calling EC2 fails (throttled or permission issues), or if kube-proxy takes a long time to start up. The CNI currently relies on kube-proxy to set the correct iptables rules for the API server endpoint on the node before it can start up.

Gotcha, so you're suggesting going to 1.6.3 which incorporates those changes, and then tuning up the initialDelaySeconds on the liveness probe?

EDIT: Actually, I can just wait until the next release that includes #1028 I suppose. Let me know your thoughts (and when that might come out)

mogren · 2020-09-04T04:42:07Z

@InAnimaTe Hey if this was a test cluster, would you be interesting in testing the latest release candidate v1.7.2-rc1 to see if this resolves the issue you saw here? In particular, #1186 changes the start up behavior by actually waiting for iptables to be available to update.

mogren · 2020-09-23T17:59:01Z

@InAnimaTe Hi, have you tried with v1.7.2 or later versions? Are you still seeing this restart issue?

InAnimaTe · 2020-09-23T18:02:23Z

@InAnimaTe Hi, have you tried with v1.7.2 or later versions? Are you still seeing this restart issue?

Hey @mogren apologies for the delays. I've been on vacation and incredibly busy with some other work things lately. I have not tried 1.7.2 yet but I plan to start testing it over the next couple weeks. I'm going to close this issue for now and if I see the problem again, we can re-open. I strongly suspect that #1186 and other changes (#1028) your team has made will probably solve these issues for us.

Thanks so much for your continued support with these issues and the great work to make the AWS CNI better.

InAnimaTe mentioned this issue Jun 24, 2020

Timeout waiting for IPAM to start causing perpetual failure (1.6.1, EKS 1.15) #1055

Closed

max-rocket-internet mentioned this issue Jul 10, 2020

EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value" #1078

Closed

mogren added the needs additional information label Sep 10, 2020

InAnimaTe closed this as completed Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container Restarts on 1.6.1 (EKS 1.15) #1054

Container Restarts on 1.6.1 (EKS 1.15) #1054

InAnimaTe commented Jun 24, 2020 •

edited

Loading

mogren commented Jun 25, 2020

InAnimaTe commented Jun 30, 2020 •

edited

Loading

mogren commented Sep 4, 2020

mogren commented Sep 23, 2020

InAnimaTe commented Sep 23, 2020

Container Restarts on 1.6.1 (EKS 1.15) #1054

Container Restarts on 1.6.1 (EKS 1.15) #1054

Comments

InAnimaTe commented Jun 24, 2020 • edited Loading

mogren commented Jun 25, 2020

InAnimaTe commented Jun 30, 2020 • edited Loading

mogren commented Sep 4, 2020

mogren commented Sep 23, 2020

InAnimaTe commented Sep 23, 2020

InAnimaTe commented Jun 24, 2020 •

edited

Loading

InAnimaTe commented Jun 30, 2020 •

edited

Loading