Pod restarting at node startup #1049

srggavrilov · 2020-06-23T11:00:26Z

After

EKS cluster has been upgraded to 1.14
AWS CNI Plugin has been upgraded to 1.6.2
kube-proxy has been upgraded to 1.14.9

the initial aws-node Pod startup is failing with:

starting IPAM daemon in background ... ok.
checking for IPAM connectivity ...  failed.
timed out waiting for IPAM daemon to start.

The subsequent runs are always successful.

Pod state after restart:

Start Time:   Tue, 23 Jun 2020 11:15:04 +0200
State:          Running
  Started:      Tue, 23 Jun 2020 11:15:53 +0200
Last State:     Terminated                                                                                                                                     
  Reason:       Error                                                                                                                                           
  Exit Code:    1
  Started:      Tue, 23 Jun 2020 11:15:15 +0200
  Finished:     Tue, 23 Jun 2020 11:15:51 +0200

IPAMd logs:

*** START ***
{"level":"info","ts":"2020-06-23T09:15:15.675Z","caller":"aws-k8s-agent/main.go:30","msg":"Starting L-IPAMD v1.6.2  ..."}
{"level":"info","ts":"2020-06-23T09:15:45.676Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2020-06-23T09:15:45.683Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.14+. git version: v1.14.9-eks-f459c0. git tree state: clean. commit: f459c0672169dd35e77af56c24556530a05e9ab1. platform: linux/amd64"}
[...]
{"level":"debug","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:443","msg":"GetLocalPods start ..."}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: filebeat-core-xx9jc kube-system  12764c4b-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: node-problem-detector-ml8wl kube-system  126c0eb5-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Pod filebeat-core-xx9jc, Namespace kube-system, has no IP"}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Pod node-problem-detector-ml8wl, Namespace kube-system, has no IP"}
{"level":"warn","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Not able to get local pods yet (attempt 1/5): <nil>"}
{"level":"debug","ts":"2020-06-23T09:15:49.050Z","caller":"ipamd/ipamd.go:443","msg":"GetLocalPods start ..."}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: filebeat-core-xx9jc kube-system  12764c4b-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: node-problem-detector-ml8wl kube-system  126c0eb5-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Pod filebeat-core-xx9jc, Namespace kube-system, has no IP"}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Pod node-problem-detector-ml8wl, Namespace kube-system, has no IP"}
{"level":"warn","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Not able to get local pods yet (attempt 2/5): <nil>"}

*** RESTART ***
{"level":"info","ts":"2020-06-23T09:15:53.102Z","caller":"aws-k8s-agent/main.go:30","msg":"Starting L-IPAMD v1.6.2  ..."}

DaemonSet:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: aws-node
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: "10%"
  selector:
    matchLabels:
      k8s-app: aws-node
  template:
    metadata:
      labels:
        k8s-app: aws-node
    spec:
      priorityClassName: system-node-critical
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "beta.kubernetes.io/os"
                    operator: In
                    values:
                      - linux
                  - key: "beta.kubernetes.io/arch"
                    operator: In
                    values:
                      - amd64
                  - key: "eks.amazonaws.com/compute-type"
                    operator: NotIn
                    values:
                      - fargate
              - matchExpressions:
                  - key: "kubernetes.io/os"
                    operator: In
                    values:
                      - linux
                  - key: "kubernetes.io/arch"
                    operator: In
                    values:
                      - amd64
                  - key: "eks.amazonaws.com/compute-type"
                    operator: NotIn
                    values:
                      - fargate
      serviceAccountName: aws-node
      hostNetwork: true
      tolerations:
        - operator: Exists
      containers:
        - image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.6.2
          imagePullPolicy: Always
          ports:
            - containerPort: 61678
              name: metrics
          name: aws-node
          readinessProbe:
            exec:
              command: ["/app/grpc-health-probe", "-addr=:50051"]
            initialDelaySeconds: 120
          livenessProbe:
            exec:
              command: ["/app/grpc-health-probe", "-addr=:50051"]
            initialDelaySeconds: 120
          env:
            - name: AWS_VPC_K8S_CNI_LOGLEVEL
              value: DEBUG
            - name: AWS_VPC_K8S_CNI_VETHPREFIX
              value: eni
            - name: AWS_VPC_ENI_MTU
              value: "9001"
            - name: MY_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          resources:
            requests:
              cpu: 10m
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
            - mountPath: /host/etc/cni/net.d
              name: cni-net-dir
            - mountPath: /host/var/log
              name: log-dir
            - mountPath: /var/run/docker.sock
              name: dockersock
            - mountPath: /var/run/dockershim.sock
              name: dockershim
      volumes:
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
        - name: log-dir
          hostPath:
            path: /var/log
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: dockershim
          hostPath:
            path: /var/run/dockershim.sock

What I've tried and it didn't help:

increase initialDelaySeconds to 120
check kube-proxy lag; it doesn't seem to be an issue. For the same node mentioned above the kube-proxy has started at:

Started:      Tue, 23 Jun 2020 11:15:15 +0200

upgrade AWS CNI Plugin to 1.6.3

Relates to:
#872
#865

The text was updated successfully, but these errors were encountered:

achevuru · 2020-06-23T22:21:41Z

@srggavrilov Couldn't reproduce it. Went from CNI v1.5.7 to v1.6.3 and upgraded kube-proxy from 1.14.7 to 1.14.9 and I don't see any restarts. Some race-condition probably during startup on your end and bumping up initialDelaySeconds might not help here as the below will timeout in 36 secs. We might've to tune the below timeout.

amazon-vpc-cni-k8s/scripts/entrypoint.sh

Line 51 in f1f9068

wait_for_ipam() {

Will try to see if we can reproduce so that we can check out what is contributing to the delay in this case. Also, would be helpful if you can run 'aws-cni-support.sh' pre and post upgrade and share the o/p with us.

srggavrilov · 2020-06-24T10:00:14Z

@achevuru thanks for checking this. I haven't been able to reproduce this in test environment neither.

Could that be related to "Pod filebeat-core-xx9jc, Namespace kube-system, has no IP"? How's it supposed for those pods to get an IP without CNI?

I think having this timeout configurable is a good idea in any way.

Unfortunately I'm not allowed to share the whole 'aws-cni-support.sh' output from prod, cause it contains sensitive data.

achevuru · 2020-06-26T05:45:37Z

@srggavrilov Yeah, #874 and #1028 will provide the ability to configure the ipamd timeout. Will close the issue as you're no longer running in to it.

achevuru closed this as completed Jun 26, 2020

max-rocket-internet mentioned this issue Jul 10, 2020

EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value" #1078

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pod restarting at node startup #1049

Pod restarting at node startup #1049

srggavrilov commented Jun 23, 2020

achevuru commented Jun 23, 2020

srggavrilov commented Jun 24, 2020

achevuru commented Jun 26, 2020

Pod restarting at node startup #1049

Pod restarting at node startup #1049

Comments

srggavrilov commented Jun 23, 2020

achevuru commented Jun 23, 2020

srggavrilov commented Jun 24, 2020

achevuru commented Jun 26, 2020