Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pod restarting at node startup #1049

Closed
srggavrilov opened this issue Jun 23, 2020 · 3 comments
Closed

Pod restarting at node startup #1049

srggavrilov opened this issue Jun 23, 2020 · 3 comments

Comments

@srggavrilov
Copy link

After

  • EKS cluster has been upgraded to 1.14
  • AWS CNI Plugin has been upgraded to 1.6.2
  • kube-proxy has been upgraded to 1.14.9

the initial aws-node Pod startup is failing with:

starting IPAM daemon in background ... ok.
checking for IPAM connectivity ...  failed.
timed out waiting for IPAM daemon to start.

The subsequent runs are always successful.

Pod state after restart:

Start Time:   Tue, 23 Jun 2020 11:15:04 +0200
State:          Running
  Started:      Tue, 23 Jun 2020 11:15:53 +0200
Last State:     Terminated                                                                                                                                     
  Reason:       Error                                                                                                                                           
  Exit Code:    1
  Started:      Tue, 23 Jun 2020 11:15:15 +0200
  Finished:     Tue, 23 Jun 2020 11:15:51 +0200

IPAMd logs:

*** START ***
{"level":"info","ts":"2020-06-23T09:15:15.675Z","caller":"aws-k8s-agent/main.go:30","msg":"Starting L-IPAMD v1.6.2  ..."}
{"level":"info","ts":"2020-06-23T09:15:45.676Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2020-06-23T09:15:45.683Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.14+. git version: v1.14.9-eks-f459c0. git tree state: clean. commit: f459c0672169dd35e77af56c24556530a05e9ab1. platform: linux/amd64"}
[...]
{"level":"debug","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:443","msg":"GetLocalPods start ..."}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: filebeat-core-xx9jc kube-system  12764c4b-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: node-problem-detector-ml8wl kube-system  126c0eb5-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Pod filebeat-core-xx9jc, Namespace kube-system, has no IP"}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Pod node-problem-detector-ml8wl, Namespace kube-system, has no IP"}
{"level":"warn","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"info","ts":"2020-06-23T09:15:46.050Z","caller":"ipamd/ipamd.go:387","msg":"Not able to get local pods yet (attempt 1/5): <nil>"}
{"level":"debug","ts":"2020-06-23T09:15:49.050Z","caller":"ipamd/ipamd.go:443","msg":"GetLocalPods start ..."}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: filebeat-core-xx9jc kube-system  12764c4b-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:443","msg":"K8SGetLocalPodIPs discovered local Pods: node-problem-detector-ml8wl kube-system  126c0eb5-b532-11ea-840e-025eca113516"}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Pod filebeat-core-xx9jc, Namespace kube-system, has no IP"}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Pod node-problem-detector-ml8wl, Namespace kube-system, has no IP"}
{"level":"warn","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Not all pods have an IP, trying again in 3 seconds."}
{"level":"info","ts":"2020-06-23T09:15:49.051Z","caller":"ipamd/ipamd.go:387","msg":"Not able to get local pods yet (attempt 2/5): <nil>"}

*** RESTART ***
{"level":"info","ts":"2020-06-23T09:15:53.102Z","caller":"aws-k8s-agent/main.go:30","msg":"Starting L-IPAMD v1.6.2  ..."}

DaemonSet:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: aws-node
  namespace: kube-system
spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: "10%"
  selector:
    matchLabels:
      k8s-app: aws-node
  template:
    metadata:
      labels:
        k8s-app: aws-node
    spec:
      priorityClassName: system-node-critical
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "beta.kubernetes.io/os"
                    operator: In
                    values:
                      - linux
                  - key: "beta.kubernetes.io/arch"
                    operator: In
                    values:
                      - amd64
                  - key: "eks.amazonaws.com/compute-type"
                    operator: NotIn
                    values:
                      - fargate
              - matchExpressions:
                  - key: "kubernetes.io/os"
                    operator: In
                    values:
                      - linux
                  - key: "kubernetes.io/arch"
                    operator: In
                    values:
                      - amd64
                  - key: "eks.amazonaws.com/compute-type"
                    operator: NotIn
                    values:
                      - fargate
      serviceAccountName: aws-node
      hostNetwork: true
      tolerations:
        - operator: Exists
      containers:
        - image: 602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.6.2
          imagePullPolicy: Always
          ports:
            - containerPort: 61678
              name: metrics
          name: aws-node
          readinessProbe:
            exec:
              command: ["/app/grpc-health-probe", "-addr=:50051"]
            initialDelaySeconds: 120
          livenessProbe:
            exec:
              command: ["/app/grpc-health-probe", "-addr=:50051"]
            initialDelaySeconds: 120
          env:
            - name: AWS_VPC_K8S_CNI_LOGLEVEL
              value: DEBUG
            - name: AWS_VPC_K8S_CNI_VETHPREFIX
              value: eni
            - name: AWS_VPC_ENI_MTU
              value: "9001"
            - name: MY_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          resources:
            requests:
              cpu: 10m
          securityContext:
            privileged: true
          volumeMounts:
            - mountPath: /host/opt/cni/bin
              name: cni-bin-dir
            - mountPath: /host/etc/cni/net.d
              name: cni-net-dir
            - mountPath: /host/var/log
              name: log-dir
            - mountPath: /var/run/docker.sock
              name: dockersock
            - mountPath: /var/run/dockershim.sock
              name: dockershim
      volumes:
        - name: cni-bin-dir
          hostPath:
            path: /opt/cni/bin
        - name: cni-net-dir
          hostPath:
            path: /etc/cni/net.d
        - name: log-dir
          hostPath:
            path: /var/log
        - name: dockersock
          hostPath:
            path: /var/run/docker.sock
        - name: dockershim
          hostPath:
            path: /var/run/dockershim.sock

What I've tried and it didn't help:

  • increase initialDelaySeconds to 120
  • check kube-proxy lag; it doesn't seem to be an issue. For the same node mentioned above the kube-proxy has started at:
Started:      Tue, 23 Jun 2020 11:15:15 +0200
  • upgrade AWS CNI Plugin to 1.6.3

Relates to:
#872
#865

@achevuru
Copy link
Contributor

@srggavrilov Couldn't reproduce it. Went from CNI v1.5.7 to v1.6.3 and upgraded kube-proxy from 1.14.7 to 1.14.9 and I don't see any restarts. Some race-condition probably during startup on your end and bumping up initialDelaySeconds might not help here as the below will timeout in 36 secs. We might've to tune the below timeout.

wait_for_ipam() {

Will try to see if we can reproduce so that we can check out what is contributing to the delay in this case. Also, would be helpful if you can run 'aws-cni-support.sh' pre and post upgrade and share the o/p with us.

@srggavrilov
Copy link
Author

@achevuru thanks for checking this. I haven't been able to reproduce this in test environment neither.

Could that be related to "Pod filebeat-core-xx9jc, Namespace kube-system, has no IP"? How's it supposed for those pods to get an IP without CNI?

I think having this timeout configurable is a good idea in any way.

Unfortunately I'm not allowed to share the whole 'aws-cni-support.sh' output from prod, cause it contains sensitive data.

@achevuru
Copy link
Contributor

@srggavrilov Yeah, #874 and #1028 will provide the ability to configure the ipamd timeout. Will close the issue as you're no longer running in to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants