Pod stuck in CrashLoopBackOff after restarting the pod #1422

mattalberts · 2021-04-08T20:35:31Z

What happened:
An update to the pod log-level was release, DEBUG to INFO for both:

AWS_VPC_K8S_CNI_LOGLEVEL
AWS_VPC_K8S_PLUGIN_LOG_LEVEL

On apply, the pod was stuck in CrashLoopBackOff. The change was reverted, restoring DEBUG level, but the pod remained stuck in CrashLoopBackOff. The section asked me to describe what happened; its certainly not related to the change to the log level. I suspect the selected eni on start is incorrect?

{"level":"info","ts":"2021-04-08T20:15:17.105Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-04-08T20:15:17.119Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-04-08T20:15:17.120Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2021-04-08T20:15:17.142Z","caller":"aws-k8s-agent/main.go:28","msg":"Starting L-IPAMD v1.7.5  ..."}
{"level":"info","ts":"2021-04-08T20:15:17.186Z","caller":"aws-k8s-agent/main.go:41","msg":"Testing communication with server"}
{"level":"info","ts":"2021-04-08T20:15:17.187Z","caller":"aws-k8s-agent/main.go:41","msg":"Successful communication with the Cluster! Cluster Version is: v1.16+. git version: v1.16.15-eks-ad4801. git tree state: clean. commit: ad4801fd44fe0f125c8d13f1b1d4827e8884476d. platform: linux/amd64"}
{"level":"debug","ts":"2021-04-08T20:15:17.189Z","caller":"ipamd/ipamd.go:292","msg":"Discovered region: ca-central-1"}
{"level":"info","ts":"2021-04-08T20:15:17.189Z","caller":"ipamd/ipamd.go:292","msg":"Custom networking false"}
{"level":"debug","ts":"2021-04-08T20:15:17.190Z","caller":"awsutils/awsutils.go:313","msg":"Found availability zone: ca-central-1a "}
{"level":"debug","ts":"2021-04-08T20:15:17.190Z","caller":"awsutils/awsutils.go:313","msg":"Discovered the instance primary ip address: 10.3.99.93"}
{"level":"debug","ts":"2021-04-08T20:15:17.191Z","caller":"awsutils/awsutils.go:313","msg":"Found instance-id: i-04f6a4021955bebb4 "}
{"level":"debug","ts":"2021-04-08T20:15:17.191Z","caller":"awsutils/awsutils.go:313","msg":"Found instance-type: c5.12xlarge "}
{"level":"debug","ts":"2021-04-08T20:15:17.192Z","caller":"awsutils/awsutils.go:313","msg":"Found primary interface's MAC address: 02:99:a5:52:66:02"}
{"level":"debug","ts":"2021-04-08T20:15:17.192Z","caller":"awsutils/awsutils.go:373","msg":"Discovered 2 interfaces."}
{"level":"debug","ts":"2021-04-08T20:15:17.193Z","caller":"awsutils/awsutils.go:373","msg":"Found device-number: 1 "}
{"level":"debug","ts":"2021-04-08T20:15:17.194Z","caller":"awsutils/awsutils.go:373","msg":"Found eni: eni-008f2e973d4be3be9 "}
{"level":"debug","ts":"2021-04-08T20:15:17.195Z","caller":"awsutils/awsutils.go:373","msg":"Found device-number: 0 "}
{"level":"debug","ts":"2021-04-08T20:15:17.195Z","caller":"awsutils/awsutils.go:373","msg":"Found eni: eni-06cfa14756d771eb5 "}
{"level":"debug","ts":"2021-04-08T20:15:17.195Z","caller":"awsutils/awsutils.go:373","msg":"eni-06cfa14756d771eb5 is the primary ENI of this instance"}
{"level":"debug","ts":"2021-04-08T20:15:17.196Z","caller":"awsutils/awsutils.go:313","msg":"Found subnet-id: subnet-068bf79668b2e8c29 "}
{"level":"info","ts":"2021-04-08T20:15:17.197Z","caller":"awsutils/awsutils.go:388","msg":"Found sg-003a220735ba852b1, added to ipamd cache"}
{"level":"info","ts":"2021-04-08T20:15:17.197Z","caller":"awsutils/awsutils.go:388","msg":"Found sg-0a509324a8a136303, added to ipamd cache"}
{"level":"debug","ts":"2021-04-08T20:15:17.197Z","caller":"awsutils/awsutils.go:442","msg":"Total number of interfaces found: 2 "}
{"level":"debug","ts":"2021-04-08T20:15:17.197Z","caller":"awsutils/awsutils.go:569","msg":"Found ENI MAC address: 02:5f:aa:38:5e:86"}
{"level":"debug","ts":"2021-04-08T20:15:17.199Z","caller":"awsutils/awsutils.go:569","msg":"Found ENI: eni-008f2e973d4be3be9, MAC 02:5f:aa:38:5e:86, device 1"}
{"level":"debug","ts":"2021-04-08T20:15:17.199Z","caller":"awsutils/awsutils.go:589","msg":"Found CIDR 10.3.96.0/20 for ENI 02:5f:aa:38:5e:86"}
{"level":"debug","ts":"2021-04-08T20:15:17.200Z","caller":"awsutils/awsutils.go:589","msg":"Found IP addresses [10.3.111.153] on ENI 02:5f:aa:38:5e:86"}
{"level":"debug","ts":"2021-04-08T20:15:17.200Z","caller":"awsutils/awsutils.go:569","msg":"Found ENI MAC address: 02:99:a5:52:66:02"}
{"level":"debug","ts":"2021-04-08T20:15:17.202Z","caller":"awsutils/awsutils.go:583","msg":"Using device number 0 for primary ENI: eni-06cfa14756d771eb5"}
{"level":"debug","ts":"2021-04-08T20:15:17.202Z","caller":"awsutils/awsutils.go:569","msg":"Found ENI: eni-06cfa14756d771eb5, MAC 02:99:a5:52:66:02, device 0"}
{"level":"debug","ts":"2021-04-08T20:15:17.203Z","caller":"awsutils/awsutils.go:589","msg":"Found CIDR 10.3.96.0/20 for ENI 02:99:a5:52:66:02"}
{"level":"debug","ts":"2021-04-08T20:15:17.204Z","caller":"awsutils/awsutils.go:589","msg":"Found IP addresses [10.3.99.93 10.3.97.146 10.3.103.92 10.3.98.47 10.3.108.182 10.3.99.15 10.3.103.31 10.3.109.163 10.3.99.165 10.3.107.48 10.3.103.168 10.3.99.103 10.3.98.209 10.3.107.79 10.3.108.113 10.3.98.188 10.3.104.116 10.3.109.252 10.3.110.20 10.3.105.247 10.3.100.188 10.3.97.248 10.3.102.214 10.3.111.4 10.3.103.157 10.3.102.60] on ENI 02:99:a5:52:66:02"}
{"level":"debug","ts":"2021-04-08T20:15:17.204Z","caller":"awsutils/awsutils.go:388","msg":"Update ENI eni-008f2e973d4be3be9"}
{"level":"error","ts":"2021-04-08T20:15:17.411Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: ipamd: can not initialize with AWS SDK interface: refreshSGIDs: unable to update the ENI's SG: InvalidNetworkInterfaceID.NotFound: The networkInterface ID 'eni-008f2e973d4be3be9' does not exist\n\tstatus code: 400, request id: 37c1197a-8c1a-4ed0-bcca-5b5504bffd16"}

Notice, the NotFound interface eni-008f2e973d4be3be9 is listed as device-number: 1. However, the primary eni for the instance is eni-06cfa14756d771eb5.

Attach logs

issue to early in pod start up to exec and capture

What you expected to happen:

the pod should restart

How to reproduce it (as minimally and precisely as possible):

restart the pod

Anything else we need to know?:

% kubectl -n kube-system describe po aws-node-shhgn
Name:                 aws-node-shhgn
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-10-3-99-93.ca-central-1.compute.internal/10.3.99.93
Start Time:           Thu, 08 Apr 2021 13:05:20 -0700
Labels:               controller-revision-hash=dc9cc4bcd
                      k8s-app=aws-node
                      pod-template-generation=6
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   10.3.99.93
IPs:
  IP:           10.3.99.93
Controlled By:  DaemonSet/aws-node
Init Containers:
  aws-vpc-cni-init:
    Container ID:   docker://a599b72e7b7e5b77c931cdef346b0dfdb86ac115562dc439e1e6e7c116cdf43b
    Image:          602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni-init@sha256:d96d712513464de6ce94e422634a25546565418f20d1b28d3bce399d578f3296
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 08 Apr 2021 13:05:20 -0700
      Finished:     Thu, 08 Apr 2021 13:05:20 -0700
    Ready:          True
    Restart Count:  0
    Environment:
      DISABLE_TCP_EARLY_DEMUX:  false
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from aws-node-token-mgmnk (ro)
Containers:
  aws-node:
    Container ID:   docker://5dc44c004100abb16c3a8d5f9b260401bac8b369e513bd08e692e093b753b60c
    Image:          602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni@sha256:f310c918ee2b4ebced76d2d64a2ec128dde3b364d1b495f0ae73011f489d474d
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Thu, 08 Apr 2021 13:28:07 -0700
      Finished:     Thu, 08 Apr 2021 13:29:46 -0700
    Ready:          False
    Restart Count:  9
    Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=60s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=1s timeout=1s period=10s #success=1 #failure=3
    Environment:
      ADDITIONAL_ENI_TAGS:                 {}
      AWS_VPC_CNI_NODE_PORT_SUPPORT:       true
      AWS_VPC_ENI_MTU:                     9001
      AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER:  false
      AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:  false
      AWS_VPC_K8S_CNI_EXTERNALSNAT:        false
      AWS_VPC_K8S_CNI_LOGLEVEL:            DEBUG
      AWS_VPC_K8S_CNI_LOG_FILE:            stdout
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:       prng
      AWS_VPC_K8S_CNI_VETHPREFIX:          eni
      AWS_VPC_K8S_PLUGIN_LOG_FILE:         stderr
      AWS_VPC_K8S_PLUGIN_LOG_LEVEL:        DEBUG
      DISABLE_INTROSPECTION:               false
      DISABLE_METRICS:                     false
      ENABLE_POD_ENI:                      false
      MY_NODE_NAME:                         (v1:spec.nodeName)
      WARM_ENI_TARGET:                     1
      WARM_IP_TARGET:                      10
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /host/var/log/aws-routed-eni from log-dir (rw)
      /run/xtables.lock from xtables-lock (rw)
      /var/run/aws-node from run-dir (rw)
      /var/run/dockershim.sock from dockershim (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from aws-node-token-mgmnk (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  dockershim:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/dockershim.sock
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  
  log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/aws-routed-eni
    HostPathType:  DirectoryOrCreate
  run-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/aws-node
    HostPathType:  DirectoryOrCreate
  aws-node-token-mgmnk:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  aws-node-token-mgmnk
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     
                 node.kubernetes.io/disk-pressure:NoSchedule
                 node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/network-unavailable:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute
                 node.kubernetes.io/pid-pressure:NoSchedule
                 node.kubernetes.io/unreachable:NoExecute
                 node.kubernetes.io/unschedulable:NoSchedule
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Pulling    29m                   kubelet            Pulling image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1"
  Normal   Pulled     29m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1"
  Normal   Created    29m                   kubelet            Created container aws-vpc-cni-init
  Normal   Started    29m                   kubelet            Started container aws-vpc-cni-init
  Normal   Scheduled  29m                   default-scheduler  Successfully assigned kube-system/aws-node-shhgn to ip-10-3-99-93.ca-central-1.compute.internal
  Warning  Unhealthy  29m                   kubelet            Readiness probe failed: {"level":"info","ts":"2021-04-08T20:05:33.063Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  29m                   kubelet            Readiness probe failed: {"level":"info","ts":"2021-04-08T20:05:43.070Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  28m                   kubelet            Readiness probe failed: {"level":"info","ts":"2021-04-08T20:05:53.063Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  28m                   kubelet            Readiness probe failed: {"level":"info","ts":"2021-04-08T20:06:03.072Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  28m                   kubelet            Readiness probe failed: {"level":"info","ts":"2021-04-08T20:06:13.062Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  28m                   kubelet            Readiness probe failed: {"level":"info","ts":"2021-04-08T20:06:23.062Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  28m                   kubelet            Liveness probe failed: {"level":"info","ts":"2021-04-08T20:06:26.647Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  28m                   kubelet            Readiness probe failed: {"level":"info","ts":"2021-04-08T20:06:33.061Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  Unhealthy  28m                   kubelet            Liveness probe failed: {"level":"info","ts":"2021-04-08T20:06:36.643Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Normal   Killing    27m                   kubelet            Container aws-node failed liveness probe, will be restarted
  Normal   Pulled     27m (x2 over 29m)     kubelet            Successfully pulled image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
  Normal   Pulling    27m (x2 over 29m)     kubelet            Pulling image "602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
  Normal   Created    27m (x2 over 29m)     kubelet            Created container aws-node
  Normal   Started    27m (x2 over 29m)     kubelet            Started container aws-node
  Warning  Unhealthy  14m (x86 over 28m)    kubelet            (combined from similar events): Readiness probe failed: {"level":"info","ts":"2021-04-08T20:20:23.063Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
  Warning  BackOff    4m23s (x43 over 17m)  kubelet            Back-off restarting failed container

Environment:

Kubernetes version (use kubectl version): v1.16.15-eks-ad4801
- Server Version: version.Info{Major:"1", Minor:"16+", GitVersion:"v1.16.15-eks-ad4801", GitCommit:"ad4801fd44fe0f125c8d13f1b1d4827e8884476d", GitTreeState:"clean", BuildDate:"2020-10-20T23:27:12Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
CNI Version: v1.7.5-eksbuild.1
- 602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni-init:v1.7.5-eksbuild.1
- 602401143452.dkr.ecr.ca-central-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1

The text was updated successfully, but these errors were encountered:

jayanthvn · 2021-04-08T21:44:29Z

Yes, this seems to be because of stale ec2 meta data #1340. This PR #1341 is part of the fix, i.e, aws-node will not crash but will start logging the errors and also a count is added to compute the number of failures. We are working with ec2/metadata team for the sync issue.

mattalberts · 2021-04-08T22:21:44Z

Thanks for the quick reply :) and information!

jayanthvn · 2021-04-08T23:56:08Z

Happy to help! I will close this issue and will keep #1340 and #1341 updated.

mattalberts · 2021-04-09T14:44:10Z

Sounds good!

mattalberts added the bug label Apr 8, 2021

jayanthvn closed this as completed Apr 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod stuck in CrashLoopBackOff after restarting the pod #1422

Pod stuck in CrashLoopBackOff after restarting the pod #1422

mattalberts commented Apr 8, 2021

jayanthvn commented Apr 8, 2021

mattalberts commented Apr 8, 2021

jayanthvn commented Apr 8, 2021

mattalberts commented Apr 9, 2021

Pod stuck in CrashLoopBackOff after restarting the pod #1422

Pod stuck in CrashLoopBackOff after restarting the pod #1422

Comments

mattalberts commented Apr 8, 2021

jayanthvn commented Apr 8, 2021

mattalberts commented Apr 8, 2021

jayanthvn commented Apr 8, 2021

mattalberts commented Apr 9, 2021