Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts always on start #1930

Closed
matti opened this issue Mar 15, 2022 · 23 comments
Closed

aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts always on start #1930

matti opened this issue Mar 15, 2022 · 23 comments
Assignees
Labels

Comments

@matti
Copy link

matti commented Mar 15, 2022

What happened:
A new node fails to wait for IPAM-D and gets restarted. After the restart it works.

Attach logs
from the failed/terminated container run:

{"level":"info","ts":"2022-03-15T10:00:10.600Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-15T10:00:10.601Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-15T10:00:10.612Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-15T10:00:10.613Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-15T10:00:12.620Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:14.626Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:16.633Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:18.640Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:20.646Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:22.653Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:24.660Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:26.667Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:28.674Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:30.680Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:32.686Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:34.693Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:36.700Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:38.706Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:40.713Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:42.720Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:44.726Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:46.733Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:48.740Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:00:50.746Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

after restart:

{"level":"info","ts":"2022-03-15T10:01:48.812Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-15T10:01:48.813Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-15T10:01:48.826Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-15T10:01:48.827Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-15T10:01:50.835Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:01:52.841Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-15T10:01:52.857Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-03-15T10:01:52.860Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-03-15T10:01:52.861Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

What you expected to happen:
No restart

How to reproduce it (as minimally and precisely as possible):
Create new eksctl cluster, add nodes. Happens on ~25% of the starts

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.21
  • CNI Version 602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni:v1.10.2-eksbuild.1
  • OS (e.g: cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"
  • Kernel (e.g. uname -a):
Linux ip-192-168-33-91.eu-north-1.compute.internal 5.4.181-99.354.amzn2.x86_64 #1 SMP Wed Mar 2 18:50:46 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@matti matti added the bug label Mar 15, 2022
@jayanthvn
Copy link
Contributor

jayanthvn commented Mar 15, 2022

Hi, can you please check if Kube-proxy is up? Ref : #1078

There is a known issue where kube-proxy takes time to start. You would see this in kube-proxy logs -

I0209 13:41:30.395787       1 feature_gate.go:243] feature gates: &{map[]}
E0209 13:41:30.988955       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:31.999074       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:34.259534       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:38.355840       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:47.022235       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:42:05.550145       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found

@matti
Copy link
Author

matti commented Mar 16, 2022

@jayanthvn it is up and doesn't look like #1078

I managed to reproduce this on the second node I scaled up:

image

and then here are the describes for both aws-node and kube-proxy and logs (from terminated aws-node, running aws-node and kube-proxy)

$ kubectl describe pod -n kube-system aws-node-p24gz
Name:                 aws-node-p24gz
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-192-168-89-1.eu-north-1.compute.internal/2a05:d016:dd0:4a02:6acb:a713:f755:ed36
Start Time:           Wed, 16 Mar 2022 09:54:53 +0200
Labels:               app.kubernetes.io/name=aws-node
                      controller-revision-hash=fcd885d64
                      k8s-app=aws-node
                      pod-template-generation=3
Annotations:          eksctl.io/restartedAt: 2022-03-15T15:08:22+02:00
                      kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   2a05:d016:dd0:4a02:6acb:a713:f755:ed36
IPs:
  IP:           2a05:d016:dd0:4a02:6acb:a713:f755:ed36
  IP:           16.170.222.246
Controlled By:  DaemonSet/aws-node
Init Containers:
  aws-vpc-cni-init:
    Container ID:   docker://e26240d2882c65d5af7453e86914d7ca37a65ed985b86a14c014fc9e159b5968
    Image:          602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni-init:v1.10.2-eksbuild.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni-init@sha256:a5b9f0654b6997a7f20395c6f5a4d9ffa925b84d042177426229b6e8a5a5f8ea
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 16 Mar 2022 09:55:04 +0200
      Finished:     Wed, 16 Mar 2022 09:55:04 +0200
    Ready:          True
    Restart Count:  0
    Environment:
      DISABLE_TCP_EARLY_DEMUX:      false
      ENABLE_IPv6:                  true
      AWS_STS_REGIONAL_ENDPOINTS:   regional
      AWS_DEFAULT_REGION:           eu-north-1
      AWS_REGION:                   eu-north-1
      AWS_ROLE_ARN:                 arn:aws:iam::254324309357:role/eksctl-test-1-addon-vpc-cni-Role1-R8JN8FC9EHA6
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jsm27 (ro)
Containers:
  aws-node:
    Container ID:   docker://7d5b09869dd4db1ed8cff6f91dff91a07d91a156f8bc7a0a693872114a43c1af
    Image:          602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni:v1.10.2-eksbuild.1
    Image ID:       docker-pullable://602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni@sha256:8c0135371030c6513e05f472dd66aaf7c722459781647150286618f054a4c782
    Port:           61678/TCP
    Host Port:      61678/TCP
    State:          Running
      Started:      Wed, 16 Mar 2022 09:56:51 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Wed, 16 Mar 2022 09:55:07 +0200
      Finished:     Wed, 16 Mar 2022 09:56:51 +0200
    Ready:          True
    Restart Count:  1
    Requests:
      cpu:      25m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=60s timeout=10s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=5s -rpc-timeout=5s] delay=1s timeout=10s period=10s #success=1 #failure=3
    Environment:
      ADDITIONAL_ENI_TAGS:                    {}
      AWS_VPC_CNI_NODE_PORT_SUPPORT:          true
      AWS_VPC_ENI_MTU:                        9001
      AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER:     false
      AWS_VPC_K8S_CNI_CUSTOM_NETWORK_CFG:     false
      AWS_VPC_K8S_CNI_EXTERNALSNAT:           false
      AWS_VPC_K8S_CNI_LOGLEVEL:               DEBUG
      AWS_VPC_K8S_CNI_LOG_FILE:               /host/var/log/aws-routed-eni/ipamd.log
      AWS_VPC_K8S_CNI_RANDOMIZESNAT:          prng
      AWS_VPC_K8S_CNI_VETHPREFIX:             eni
      AWS_VPC_K8S_PLUGIN_LOG_FILE:            /var/log/aws-routed-eni/plugin.log
      AWS_VPC_K8S_PLUGIN_LOG_LEVEL:           DEBUG
      DISABLE_INTROSPECTION:                  false
      DISABLE_METRICS:                        false
      DISABLE_NETWORK_RESOURCE_PROVISIONING:  false
      ENABLE_IPv4:                            false
      ENABLE_IPv6:                            true
      ENABLE_POD_ENI:                         false
      ENABLE_PREFIX_DELEGATION:               true
      MY_NODE_NAME:                            (v1:spec.nodeName)
      WARM_ENI_TARGET:                        1
      WARM_PREFIX_TARGET:                     1
      AWS_STS_REGIONAL_ENDPOINTS:             regional
      AWS_DEFAULT_REGION:                     eu-north-1
      AWS_REGION:                             eu-north-1
      AWS_ROLE_ARN:                           arn:aws:iam::254324309357:role/eksctl-test-1-addon-vpc-cni-Role1-R8JN8FC9EHA6
      AWS_WEB_IDENTITY_TOKEN_FILE:            /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /host/var/log/aws-routed-eni from log-dir (rw)
      /run/xtables.lock from xtables-lock (rw)
      /var/run/aws-node from run-dir (rw)
      /var/run/dockershim.sock from dockershim (rw)
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jsm27 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  dockershim:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/dockershim.sock
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  
  log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/aws-routed-eni
    HostPathType:  DirectoryOrCreate
  run-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/aws-node
    HostPathType:  DirectoryOrCreate
  kube-api-access-jsm27:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  3m9s               default-scheduler  Successfully assigned kube-system/aws-node-p24gz to ip-192-168-89-1.eu-north-1.compute.internal
  Normal   Pulling    3m6s               kubelet            Pulling image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni-init:v1.10.2-eksbuild.1"
  Normal   Pulled     2m58s              kubelet            Successfully pulled image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni-init:v1.10.2-eksbuild.1" in 7.981068408s
  Normal   Created    2m58s              kubelet            Created container aws-vpc-cni-init
  Normal   Started    2m58s              kubelet            Started container aws-vpc-cni-init
  Normal   Pulling    2m57s              kubelet            Pulling image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni:v1.10.2-eksbuild.1"
  Normal   Pulled     2m55s              kubelet            Successfully pulled image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni:v1.10.2-eksbuild.1" in 1.950825839s
  Normal   Created    2m55s              kubelet            Created container aws-node
  Normal   Started    2m55s              kubelet            Started container aws-node
  Warning  Unhealthy  2m49s              kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:55:13.546Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m44s              kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:55:18.612Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m37s              kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:55:25.300Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m27s              kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:55:35.255Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m17s              kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:55:45.252Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  2m7s               kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:55:55.266Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  117s               kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:56:05.315Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  107s               kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-16T07:56:15.269Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  101s               kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-16T07:56:21.266Z","caller":"/usr/local/go/src/runtime/proc.go:255","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Normal   Killing    81s                kubelet            Container aws-node failed liveness probe, will be restarted
  Warning  Unhealthy  71s (x6 over 97s)  kubelet            (combined from similar events): Readiness probe failed:
  Normal   Pulled     71s                kubelet            Container image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/amazon-k8s-cni:v1.10.2-eksbuild.1" already present on machine
$ kubectl describe -n kube-system pod kube-proxy-hsdm2 
Name:                 kube-proxy-hsdm2
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-192-168-89-1.eu-north-1.compute.internal/2a05:d016:dd0:4a02:6acb:a713:f755:ed36
Start Time:           Wed, 16 Mar 2022 09:54:53 +0200
Labels:               controller-revision-hash=696598889
                      k8s-app=kube-proxy
                      pod-template-generation=1
Annotations:          kubernetes.io/psp: eks.privileged
Status:               Running
IP:                   2a05:d016:dd0:4a02:6acb:a713:f755:ed36
IPs:
  IP:           2a05:d016:dd0:4a02:6acb:a713:f755:ed36
  IP:           16.170.222.246
Controlled By:  DaemonSet/kube-proxy
Containers:
  kube-proxy:
    Container ID:  docker://16476005fc68d2e6313f48c2c87639cc1d49ebdb10278b78783708bc47291047
    Image:         602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/kube-proxy:v1.21.2-eksbuild.2
    Image ID:      docker-pullable://602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/kube-proxy@sha256:0ea6717ed144c7f04922bf56662d58d5b14b7b62ef78c70e636a02d22052681c
    Port:          <none>
    Host Port:     <none>
    Command:
      kube-proxy
      --v=2
      --config=/var/lib/kube-proxy-config/config
    State:          Running
      Started:      Wed, 16 Mar 2022 09:55:03 +0200
    Ready:          True
    Restart Count:  0
    Requests:
      cpu:        100m
    Environment:  <none>
    Mounts:
      /lib/modules from lib-modules (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/kube-proxy-config/ from config (rw)
      /var/lib/kube-proxy/ from kubeconfig (rw)
      /var/log from varlog (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-24xkn (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  varlog:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  kubeconfig:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-proxy
    Optional:  false
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      kube-proxy-config
    Optional:  false
  kube-api-access-24xkn:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m7s   default-scheduler  Successfully assigned kube-system/kube-proxy-hsdm2 to ip-192-168-89-1.eu-north-1.compute.internal
  Normal  Pulling    4m4s   kubelet            Pulling image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/kube-proxy:v1.21.2-eksbuild.2"
  Normal  Pulled     3m59s  kubelet            Successfully pulled image "602401143452.dkr.ecr.eu-north-1.amazonaws.com/eks/kube-proxy:v1.21.2-eksbuild.2" in 5.029419457s
  Normal  Created    3m57s  kubelet            Created container kube-proxy
  Normal  Started    3m57s  kubelet            Started container kube-proxy

From the crashed aws-node:

$ kubectl logs -n kube-system aws-node-p24gz --previous=true
{"level":"info","ts":"2022-03-16T07:55:07.710Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-16T07:55:07.711Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-16T07:55:07.726Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-16T07:55:07.730Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-16T07:55:09.739Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:11.749Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:13.758Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:15.765Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:17.773Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:19.782Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:21.792Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:23.800Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:25.811Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:27.818Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:29.826Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:31.836Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:33.843Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:35.856Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:37.864Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:39.872Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:41.879Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:43.888Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:45.897Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:47.906Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:49.914Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:51.924Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:53.932Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:55.942Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:57.954Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:55:59.964Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:01.975Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:03.983Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:05.990Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:07.998Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:10.010Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:12.022Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:14.029Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:16.038Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:18.045Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:20.060Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:22.068Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:24.079Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:26.086Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:28.100Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:30.108Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:32.116Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:34.124Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:36.134Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:38.150Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:40.157Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:42.167Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:44.180Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:46.190Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:48.197Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:50.205Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

after the restart:

$ kubectl logs -n kube-system aws-node-p24gz                
{"level":"info","ts":"2022-03-16T07:56:51.522Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-16T07:56:51.523Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-16T07:56:51.540Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-16T07:56:51.546Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-16T07:56:53.556Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:55.564Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-16T07:56:55.584Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2022-03-16T07:56:55.595Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2022-03-16T07:56:55.596Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

and from the kube-proxy that didn't restart

$ kubectl logs -n kube-system kube-proxy-hsdm2 
I0316 07:55:04.065581       1 flags.go:59] FLAG: --add-dir-header="false"
I0316 07:55:04.065662       1 flags.go:59] FLAG: --alsologtostderr="false"
I0316 07:55:04.065668       1 flags.go:59] FLAG: --bind-address="0.0.0.0"
I0316 07:55:04.065675       1 flags.go:59] FLAG: --bind-address-hard-fail="false"
I0316 07:55:04.065682       1 flags.go:59] FLAG: --boot-id-file="/proc/sys/kernel/random/boot_id"
I0316 07:55:04.065687       1 flags.go:59] FLAG: --cleanup="false"
I0316 07:55:04.065691       1 flags.go:59] FLAG: --cluster-cidr=""
I0316 07:55:04.065702       1 flags.go:59] FLAG: --config="/var/lib/kube-proxy-config/config"
I0316 07:55:04.065708       1 flags.go:59] FLAG: --config-sync-period="15m0s"
I0316 07:55:04.065715       1 flags.go:59] FLAG: --conntrack-max-per-core="32768"
I0316 07:55:04.065725       1 flags.go:59] FLAG: --conntrack-min="131072"
I0316 07:55:04.065730       1 flags.go:59] FLAG: --conntrack-tcp-timeout-close-wait="1h0m0s"
I0316 07:55:04.065735       1 flags.go:59] FLAG: --conntrack-tcp-timeout-established="24h0m0s"
I0316 07:55:04.065740       1 flags.go:59] FLAG: --detect-local-mode=""
I0316 07:55:04.065754       1 flags.go:59] FLAG: --feature-gates=""
I0316 07:55:04.065762       1 flags.go:59] FLAG: --healthz-bind-address="0.0.0.0:10256"
I0316 07:55:04.065774       1 flags.go:59] FLAG: --healthz-port="10256"
I0316 07:55:04.065787       1 flags.go:59] FLAG: --help="false"
I0316 07:55:04.065795       1 flags.go:59] FLAG: --hostname-override=""
I0316 07:55:04.065800       1 flags.go:59] FLAG: --iptables-masquerade-bit="14"
I0316 07:55:04.065805       1 flags.go:59] FLAG: --iptables-min-sync-period="1s"
I0316 07:55:04.065816       1 flags.go:59] FLAG: --iptables-sync-period="30s"
I0316 07:55:04.065822       1 flags.go:59] FLAG: --ipvs-exclude-cidrs="[]"
I0316 07:55:04.065837       1 flags.go:59] FLAG: --ipvs-min-sync-period="0s"
I0316 07:55:04.065842       1 flags.go:59] FLAG: --ipvs-scheduler=""
I0316 07:55:04.065847       1 flags.go:59] FLAG: --ipvs-strict-arp="false"
I0316 07:55:04.065852       1 flags.go:59] FLAG: --ipvs-sync-period="30s"
I0316 07:55:04.065857       1 flags.go:59] FLAG: --ipvs-tcp-timeout="0s"
I0316 07:55:04.065869       1 flags.go:59] FLAG: --ipvs-tcpfin-timeout="0s"
I0316 07:55:04.065874       1 flags.go:59] FLAG: --ipvs-udp-timeout="0s"
I0316 07:55:04.065879       1 flags.go:59] FLAG: --kube-api-burst="10"
I0316 07:55:04.065884       1 flags.go:59] FLAG: --kube-api-content-type="application/vnd.kubernetes.protobuf"
I0316 07:55:04.065890       1 flags.go:59] FLAG: --kube-api-qps="5"
I0316 07:55:04.065930       1 flags.go:59] FLAG: --kubeconfig=""
I0316 07:55:04.065935       1 flags.go:59] FLAG: --log-backtrace-at=":0"
I0316 07:55:04.065949       1 flags.go:59] FLAG: --log-dir=""
I0316 07:55:04.065955       1 flags.go:59] FLAG: --log-file=""
I0316 07:55:04.065960       1 flags.go:59] FLAG: --log-file-max-size="1800"
I0316 07:55:04.065965       1 flags.go:59] FLAG: --log-flush-frequency="5s"
I0316 07:55:04.065970       1 flags.go:59] FLAG: --logtostderr="true"
I0316 07:55:04.065975       1 flags.go:59] FLAG: --machine-id-file="/etc/machine-id,/var/lib/dbus/machine-id"
I0316 07:55:04.065984       1 flags.go:59] FLAG: --masquerade-all="false"
I0316 07:55:04.065994       1 flags.go:59] FLAG: --master=""
I0316 07:55:04.065999       1 flags.go:59] FLAG: --metrics-bind-address="127.0.0.1:10249"
I0316 07:55:04.066005       1 flags.go:59] FLAG: --metrics-port="10249"
I0316 07:55:04.066010       1 flags.go:59] FLAG: --nodeport-addresses="[]"
I0316 07:55:04.066029       1 flags.go:59] FLAG: --one-output="false"
I0316 07:55:04.066034       1 flags.go:59] FLAG: --oom-score-adj="-999"
I0316 07:55:04.066039       1 flags.go:59] FLAG: --profiling="false"
I0316 07:55:04.066049       1 flags.go:59] FLAG: --proxy-mode=""
I0316 07:55:04.066060       1 flags.go:59] FLAG: --proxy-port-range=""
I0316 07:55:04.066066       1 flags.go:59] FLAG: --show-hidden-metrics-for-version=""
I0316 07:55:04.066071       1 flags.go:59] FLAG: --skip-headers="false"
I0316 07:55:04.066076       1 flags.go:59] FLAG: --skip-log-headers="false"
I0316 07:55:04.066088       1 flags.go:59] FLAG: --stderrthreshold="2"
I0316 07:55:04.066110       1 flags.go:59] FLAG: --udp-timeout="250ms"
I0316 07:55:04.066124       1 flags.go:59] FLAG: --v="2"
I0316 07:55:04.066129       1 flags.go:59] FLAG: --version="false"
I0316 07:55:04.066141       1 flags.go:59] FLAG: --vmodule=""
I0316 07:55:04.066147       1 flags.go:59] FLAG: --write-config-to=""
I0316 07:55:04.068559       1 feature_gate.go:243] feature gates: &{map[]}
I0316 07:55:04.069495       1 feature_gate.go:243] feature gates: &{map[]}
I0316 07:55:04.153349       1 node.go:172] Successfully retrieved node IP: 2a05:d016:dd0:4a02:6acb:a713:f755:ed36
I0316 07:55:04.153380       1 server_others.go:140] Detected node IP 2a05:d016:dd0:4a02:6acb:a713:f755:ed36
I0316 07:55:04.153494       1 server_others.go:177] DetectLocalMode: 'ClusterCIDR'
I0316 07:55:04.167247       1 server_others.go:206] kube-proxy running in dual-stack mode, IPv6-primary
I0316 07:55:04.167277       1 server_others.go:212] Using iptables Proxier.
I0316 07:55:04.167290       1 server_others.go:219] creating dualStackProxier for iptables.
W0316 07:55:04.167301       1 server_others.go:496] detect-local-mode set to ClusterCIDR, but no cluster CIDR defined
W0316 07:55:04.167307       1 server_others.go:545] detect-local-mode: ClusterCIDR , defaulting to no-op detect-local
I0316 07:55:04.167365       1 utils.go:375] Changed sysctl "net/ipv4/conf/all/route_localnet": 0 -> 1
I0316 07:55:04.167417       1 proxier.go:282] "using iptables mark for masquerade" ipFamily=IPv4 mark="0x00004000"
I0316 07:55:04.167458       1 proxier.go:330] "iptables sync params" ipFamily=IPv4 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
I0316 07:55:04.167492       1 proxier.go:340] "iptables supports --random-fully" ipFamily=IPv4
I0316 07:55:04.167536       1 proxier.go:282] "using iptables mark for masquerade" ipFamily=IPv6 mark="0x00004000"
I0316 07:55:04.167564       1 proxier.go:330] "iptables sync params" ipFamily=IPv6 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
I0316 07:55:04.167584       1 proxier.go:340] "iptables supports --random-fully" ipFamily=IPv6
I0316 07:55:04.167737       1 server.go:643] Version: v1.21.2-13+1163bd3b67234e
I0316 07:55:04.168479       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0316 07:55:04.168509       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0316 07:55:04.168586       1 mount_linux.go:192] Detected OS without systemd
I0316 07:55:04.169276       1 conntrack.go:83] Setting conntrack hashsize to 32768
I0316 07:55:04.177191       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0316 07:55:04.177246       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0316 07:55:04.177417       1 config.go:315] Starting service config controller
I0316 07:55:04.177438       1 shared_informer.go:240] Waiting for caches to sync for service config
I0316 07:55:04.177466       1 config.go:224] Starting endpoint slice config controller
I0316 07:55:04.177475       1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
W0316 07:55:04.180362       1 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
W0316 07:55:04.182312       1 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice
I0316 07:55:04.185146       1 service.go:306] Service default/kubernetes updated: 1 ports
I0316 07:55:04.185196       1 service.go:306] Service kube-system/kube-dns updated: 2 ports
I0316 07:55:04.185669       1 service.go:306] Service cluster-autoscaler-1/cluster-autoscaler-1-aws-cluster-autoscaler updated: 1 ports
I0316 07:55:04.185721       1 service.go:306] Service prometheus-1/prometheus-1-server updated: 1 ports
I0316 07:55:04.185745       1 service.go:306] Service aws-load-balancer-controller-1/aws-load-balancer-webhook-service updated: 1 ports
I0316 07:55:04.277572       1 shared_informer.go:247] Caches are synced for service config 
I0316 07:55:04.277572       1 shared_informer.go:247] Caches are synced for endpoint slice config 
I0316 07:55:04.277658       1 proxier.go:854] "Syncing iptables rules"
I0316 07:55:04.327166       1 proxier.go:824] "syncProxyRules complete" elapsed="49.524161ms"
I0316 07:55:04.327238       1 proxier.go:816] "Not syncing iptables until Services and Endpoints have been received from master"
I0316 07:55:04.327263       1 proxier.go:854] "Syncing iptables rules"
I0316 07:55:04.362116       1 proxier.go:824] "syncProxyRules complete" elapsed="34.848323ms"
I0316 07:55:04.362157       1 service.go:421] Adding new service port "default/kubernetes:https" at fdeb:b38b:1326::1:443/TCP
I0316 07:55:04.362187       1 service.go:421] Adding new service port "kube-system/kube-dns:dns-tcp" at fdeb:b38b:1326::a:53/TCP
I0316 07:55:04.362203       1 service.go:421] Adding new service port "kube-system/kube-dns:dns" at fdeb:b38b:1326::a:53/UDP
I0316 07:55:04.362220       1 service.go:421] Adding new service port "cluster-autoscaler-1/cluster-autoscaler-1-aws-cluster-autoscaler:http" at fdeb:b38b:1326::2221:8085/TCP
I0316 07:55:04.362241       1 service.go:421] Adding new service port "prometheus-1/prometheus-1-server:http" at fdeb:b38b:1326::9323:80/TCP
I0316 07:55:04.362253       1 service.go:421] Adding new service port "aws-load-balancer-controller-1/aws-load-balancer-webhook-service" at fdeb:b38b:1326::c4c0:443/TCP
I0316 07:55:04.362420       1 proxier.go:841] "Stale service" protocol="udp" svcPortName="kube-system/kube-dns:dns" clusterIP="fdeb:b38b:1326::a"
I0316 07:55:04.362445       1 proxier.go:854] "Syncing iptables rules"
I0316 07:55:04.439259       1 proxier.go:824] "syncProxyRules complete" elapsed="77.107637ms"
I0316 07:57:20.362078       1 proxier.go:854] "Syncing iptables rules"
I0316 07:57:20.386985       1 proxier.go:824] "syncProxyRules complete" elapsed="24.962893ms"
W0316 08:00:35.186329       1 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice

@matti
Copy link
Author

matti commented Mar 16, 2022

and this is not just sometimes, almost always:

image

@matti matti changed the title aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts sometimes on start aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts almost always on start Mar 16, 2022
@farhank3389
Copy link

We're also having the same issue on one of our clusters. Although it doesn't come up after a restart for us, it just keeps restarting.

@jayanthvn
Copy link
Contributor

@farhank3389 - What is you instance type? Do you have prefix delegation enabled? Can you please share the error code from /var/log/aws-routed-eni/ipamd.log?

@cgchinmay
Copy link
Contributor

cgchinmay commented Mar 16, 2022

@farhank3389 Could you collect logs send to k8s-awscni-triage@amazon.com.
What is the Instance type you are using ?
Could you confirm these repro steps

  1. Spin up cluster using eksctl with this instance type
  2. aws-node will be stuck in restart loop

Does it auto-resolve ? Were you able to resolve it ?

@cgchinmay cgchinmay self-assigned this Mar 16, 2022
@iomarcovalente
Copy link

@jayanthvn the error we are seeing we are seeing it with all instance types
the output of that command is the following:

$ cat /var/log/aws-routed-eni/ipamd.log
{"level":"info","ts":"2022-03-16T20:46:49.519Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:46:49.520Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:48:29.863Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:48:29.863Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:48:51.627Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:48:51.628Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:50:33.232Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:50:33.233Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:52:13.246Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:52:13.250Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:53:53.246Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:53:53.247Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:55:33.226Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:55:33.227Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:57:13.209Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:57:13.210Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T20:58:53.182Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T20:58:53.183Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}

@iomarcovalente
Copy link

@cgchinmay we don't use eksctl to create the cluster as we self-manage our nodes and use terraform to provision eks.
We never had issues with this until today. Matter of fact we have three other clusters and they work fine with the same configuration only today all new nodes that were spinning up went into this state.

@jayanthvn
Copy link
Contributor

@iomarcovalente - the error indicates CNI is not able to talk to API server. Can you please check maybe the security groups attached to the instances? Or you can email (k8s-awscni-traige@amazon.com) us the cluster ARN and we can do a quick check on the configs.

@iomarcovalente
Copy link

iomarcovalente commented Mar 16, 2022

is it really k8s-awscni-traige@amazon.com ? because traige is a typo FYI guys it should be triage?

@cgchinmay
Copy link
Contributor

my bad, its k8s-awscni-triage@amazon.com

@matti matti changed the title aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts almost always on start aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts always on start Mar 16, 2022
@iomarcovalente
Copy link

iomarcovalente commented Mar 16, 2022

nevermind - we found the issue. It was with kube-proxy pointing to the wrong api address due to a bug in our code. Apologies for the noise
Thanks for your help and availability

@jayanthvn
Copy link
Contributor

Thanks for the update @iomarcovalente

@matti
Copy link
Author

matti commented Mar 16, 2022

changed the title of this issue to always - I'm not seeing one node that doesn't restart, all my aws-nodes restart once in cluster. sending cluster arn and logs to k8s-awscni-triage@amazon.com now

@matti
Copy link
Author

matti commented Mar 18, 2022

image

@jayanthvn doesn't this happen always for you, too?

@jayanthvn
Copy link
Contributor

jayanthvn commented Mar 18, 2022

Looks like kube-proxy has completed programming the routes around 21:38:37.086265-

master\"\n","stream":"stderr","time":"2022-03-16T21:38:37.050403916Z"}
{"log":"I0316 21:38:37.050046       1 proxier.go:854] \"Syncing iptables rules\"\n","stream":"stderr","time":"2022-03-16T21:38:37.050413104Z"}
{"log":"I0316 21:38:37.085996       1 proxier.go:824] \"syncProxyRules complete\" elapsed=\"35.954406ms\"\n","stream":"stderr","time":"2022-03-16T21:38:37.086530318Z"}
{"log":"I0316 21:38:37.086042       1 service.go:421] Adding new service port \"metrics-server-1/metrics-server-1:https\" at fd7e:a2a4:8fcf::ab76:443/TCP\n","stream":"stderr","time":"2022-03-16T21:38:37.086552841Z"}
{"log":"I0316 21:38:37.086063       1 service.go:421] Adding new service port \"default/kubernetes:https\" at fd7e:a2a4:8fcf::1:443/TCP\n","stream":"stderr","time":"2022-03-16T21:38:37.086557513Z"}
{"log":"I0316 21:38:37.086077       1 service.go:421] Adding new service port \"kube-system/kube-dns:dns\" at fd7e:a2a4:8fcf::a:53/UDP\n","stream":"stderr","time":"2022-03-16T21:38:37.086595056Z"}
{"log":"I0316 21:38:37.086085       1 service.go:421] Adding new service port \"kube-system/kube-dns:dns-tcp\" at fd7e:a2a4:8fcf::a:53/TCP\n","stream":"stderr","time":"2022-03-16T21:38:37.08659919Z"}
{"log":"I0316 21:38:37.086096       1 service.go:421] Adding new service port \"cluster-autoscaler-1/cluster-autoscaler-1-aws-cluster-autoscaler:http\" at fd7e:a2a4:8fcf::3302:8085/TCP\n","stream":"stderr","time":"2022-03-16T21:38:37.086602904Z"}
{"log":"I0316 21:38:37.086116       1 service.go:421] Adding new service port \"aws-load-balancer-controller-1/aws-load-balancer-webhook-service\" at fd7e:a2a4:8fcf::cecc:443/TCP\n","stream":"stderr","time":"2022-03-16T21:38:37.086606406Z"}
{"log":"I0316 21:38:37.086252       1 proxier.go:841] \"Stale service\" protocol=\"udp\" svcPortName=\"kube-system/kube-dns:dns\" clusterIP=\"fd7e:a2a4:8fcf::a\"\n","stream":"stderr","time":"2022-03-16T21:38:37.086610165Z"}
{"log":"I0316 21:38:37.086265       1 proxier.go:854] \"Syncing iptables rules\"\n","stream":"stderr","time":"2022-03-16T21:38:37.086613969Z"}
{"log":"I0316 21:38:37.182094       1 proxier.go:824] \"syncProxyRules complete\" elapsed=\"96.067734ms\"\n","stream":"stderr","time":"2022-03-16T21:38:37.182182295Z"}
{"log":"W0316 21:47:40.916347       1 warnings.go:70] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice\n","stream":"stderr","time":"2022-03-16T21:47:40.916471013Z"}

During this time aws-node was waiting to test reachability to api server ->

{"level":"info","ts":"2022-03-16T21:38:40.376Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T21:38:40.376Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}

Kubelet logs ->

Mar 16 21:38:29 ip-192-168-82-160.eu-north-1.compute.internal kubelet[3726]: I0316 21:38:29.035093    3726 kubelet.go:1973] "SyncLoop (PLEG): 
event for pod" pod="kube-system/aws-node-hdd9c" event=&{ID:c818b9b6-8769-419d-b3a6-246a0dac8990 
Type:ContainerDied Data:8fea2f7abdb311abe47e61e495f5e597fdd63e6cdd995925a412ac156230f32c}
Mar 16 21:38:30 ip-192-168-82-160.eu-north-1.compute.internal kubelet[3726]: I0316 21:38:30.064174    3726 kubelet.go:1973] 
"SyncLoop (PLEG): event for pod" pod="kube-system/aws-node-hdd9c" event=&{ID:c818b9b6-8769-419d-b3a6-246a0dac8990 
Type:ContainerStarted Data:8fea2f7abdb311abe47e61e495f5e597fdd63e6cdd995925a412ac156230f32c}

Mar 16 21:38:30 ip-192-168-82-160.eu-north-1.compute.internal kubelet[3726]: I0316 21:38:30.068016    3726 kubelet.go:1973] 
"SyncLoop (PLEG): event for pod" pod="kube-system/kube-proxy-85v8f" 
event=&{ID:31d7987c-237e-4f32-8127-6611f52ad60b Type:ContainerStarted Data:3fe61b01e4c5733efe029cc99532487e4cd9d86ccd2c0629fa928db09116ea32}
Mar 16 21:38:37 ip-192-168-82-160.eu-north-1.compute.internal kubelet[3726]: I0316 21:38:37.224155    3726 kubelet.go:1973] 
"SyncLoop (PLEG): event for pod" 
pod="kube-system/kube-proxy-85v8f" event=&{ID:31d7987c-237e-4f32-8127-6611f52ad60b Type:ContainerStarted Data:e9910a84cb77c18f06e85f4114ae0e9678c50777cbc17f8a13144032d6a912f5}

Mar 16 21:38:38 ip-192-168-82-160.eu-north-1.compute.internal kubelet[3726]: I0316 21:38:38.230639    3726 kubelet.go:1973] 
"SyncLoop (PLEG): event for pod" pod="kube-system/aws-node-hdd9c" event=&{ID:c818b9b6-8769-419d-b3a6-246a0dac8990
 Type:ContainerDied Data:0de0444f063a97baba4ee15cd113b6e6a9e7cce73bd7e5ecd54be4846e20f0c4}

It has failed to reach because of kube-proxy dependency and kubelet has determined pod has dead and restarted. This time it is successful.

{"level":"info","ts":"2022-03-16T21:40:16.932Z","caller":"aws-k8s-agent/main.go:27","msg":"Starting L-IPAMD   ..."}
{"level":"info","ts":"2022-03-16T21:40:16.933Z","caller":"aws-k8s-agent/main.go:42","msg":"Testing communication with server"}
{"level":"info","ts":"2022-03-16T21:40:16.945Z","caller":"aws-k8s-agent/main.go:42","msg":"Successful communication with the Cluster! Cluster Version is: v1.21+. git version: v1.21.5-eks-bc4871b. git tree state: clean. commit: 5236faf39f1b7a7dabea8df12726f25608131aa9. platform: linux/amd64"}

This is expected behavior and we are looking into removing this dependency at startup.

@matti
Copy link
Author

matti commented Mar 18, 2022

this expected behavior also makes nodes that joined the cluster unable to run workloads for ~1min slowing down everything - how can I workaround this?

@matti
Copy link
Author

matti commented Mar 21, 2022

workarounding faster restarts with dropping livenessProbe initialDelaySeconds from 60 to 1

kubectl patch daemonset -n kube-system aws-node --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds", "value":1}]'

vijay-veeranki added a commit to ministryofjustice/cloud-platform-terraform-eks-add-ons that referenced this issue Mar 21, 2022
vijay-veeranki added a commit to ministryofjustice/cloud-platform-terraform-eks-add-ons that referenced this issue Mar 21, 2022
@crechandan
Copy link

crechandan commented Mar 30, 2022

Hello All,

I am still facing the issue related to aws-cni pod not running on my EKS cluster (via Cloudformation EKS addon functionality):

EKS 1.20.11 version

  • amazon-k8s-cni:v1.7.5-eksbuild.1
  • amazon-k8s-cni-init:v1.7.5-eksbuild.1

Even after applying the workaorund patch ie aws-node-patch.yml

spec:
template:
spec:
containers:
- name: aws-node
livenessProbe:
timeoutSeconds: 5
readinessProbe:
timeoutSeconds: 5

Below is the kube proxy logs:

kubectl logs kube-proxy-69hdx -n kube-system

E0330 10:53:04.525000       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found
E0330 10:53:05.646490       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found
E0330 10:53:08.049893       1 node.go:161] Failed to retrieve node info: nodes "ip-*" not found
E0330 10:53:12.226920       1 node.go:161] Failed to retrieve node info: nodes "ip-1*" not found

Below is the aws cni logs:

kubectl logs aws-node-fs26n -n kube-system

{"level":"info","ts":"2022-03-30T11:12:02.081Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2022-03-30T11:12:02.104Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-30T11:12:02.112Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E0330 11:12:02.441497       9 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Under describe events for aws cni pod I can see failing readiness probes

Events:

Type     Reason     Age                   From               Message
----     ------     ----                  ----               -------
Warning  Unhealthy  40m                   kubelet            **Readiness probe failed**: {"level":"info","ts":"2022-03-30T10:34:08.320Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  40m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-30T10:34:18.327Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  40m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-30T10:34:28.312Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  40m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-30T10:34:38.334Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  39m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-30T10:34:48.316Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  39m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-30T10:34:58.342Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  39m                   kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-30T10:35:07.083Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  39m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-30T10:35:08.331Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Warning  Unhealthy  39m                   kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-30T10:35:17.087Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
Normal   Killing    39m                   kubelet            Container aws-node failed liveness probe, will be restarted
Normal   Created    39m (x2 over 40m)     kubelet            Created container aws-node
Normal   Pulled     39m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1" in 205.854108ms
Normal   Pulling    39m (x2 over 40m)     kubelet            Pulling image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1"
Normal   Pulled     35m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1" in 198.271634ms
Normal   Pulled     30m                   kubelet            Successfully pulled image "602401143452.dkr.ecr.ap-southeast-1.amazonaws.com/amazon-k8s-cni:v1.7.5-eksbuild.1" in 188.030354ms
Warning  BackOff    5m43s (x74 over 29m)  kubelet            Back-off restarting failed container
Warning  Unhealthy  46s (x162 over 39m)   kubelet            (combined from similar events): Readiness probe failed: {"level":"info","ts":"2022-03-30T11:13:58.318Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}

This is not resolved. Can anyone help on this one ?
Thanks,
Chandan

@jayanthvn
Copy link
Contributor

jayanthvn commented Apr 12, 2022

This PR #1943 should help mitigate this issue up to some extent. Since previously aws-node used to retry API Server connectivity upon restart but with this change we retry few more times before a restart. This is milestoned for v1.11.0 release.

@matti
Copy link
Author

matti commented Apr 13, 2022

@jayanthvn what about #1943 (comment)

@jayanthvn
Copy link
Contributor

@matti - I see the comment is resolved. Will close this issue for now. Please try out v1.11.0 and let us know if the behavior has improved.

@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants