Waiting for ipamd health check deadlocks node bootstrapping outside EKS #575

drakedevel · 2019-08-07T02:46:47Z

Tested 1.5.2 on Kubernetes 1.15.2

The change in #553 introduced a node bootstrapping problem on our kubeadm test cluster. With this change, nodes get tainted with node.kubernetes.io/not-ready until the ipamD is healthy. ipamD can't become healthy until it's able to reach the API server. On at least kubeadm and kops clusters, the API server is a ClusterIP, which requires kube-proxy to be up and running. This circular dependency means that nodes booting up simply sit there forever.

The text was updated successfully, but these errors were encountered:

drakedevel · 2019-08-07T03:02:36Z

Sorry, this explanation is bogus -- I misread my kubectl output and kube-proxy is running fine. I am still seeing this in ipamd.log (with the result that the node is NotReady):

2019-08-07T02:24:04.835Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-08-07T02:24:04.835Z [ERROR]        Failed to create client: error communicating with apiserver: Get https://100.127.0.1:443/version?timeout=32s: dial tcp 100.127.0.1:443: i/o timeout

The cluster worked fine with 1.5.0 w/ Kubernetes 1.15.0, but I need to investigate more tomorrow. Will update here when I figure out what happened -- sorry for the noise! Feel free to close if you want, I can reopen.

mogren · 2019-08-07T03:05:42Z

@drakedevel Ok, thanks for the follow up.

We did test v1.5.2 quite a lot, both on new clusters and upgrading from older versions. I'll close this issue since since kube-proxy does start, but feel free to open another issue if you can't figure out why ipamd can't talk to the api-server.

drakedevel · 2019-08-07T03:33:59Z

@mogren It looks like the actual issue is that the aws-node pods started up faster than the kube-proxy pods, so the API server was unreachable initially. The aws-k8s-agent process apparently crashes when this happens (the process was gone when I looked), but install-aws.sh doesn't notice and continues to poll forever.

Everything works fine if the pod is manually deleted, but the node is broken until then as nothing will automatically get the pod out of this state. In 1.5.1, a aws-k8s-agent crash always results in the pod exiting and getting restarted, so if it "wins" the race against kube-proxy it's at least self-healing that way.

Ideas:

install-aws.sh could detect the child process crashing before becoming healthy and bail
install-aws.sh could have a timeout on the initial wait
The readiness probe (commented out in the example config) would work as a fallback to prevent a pod from getting stuck forever in this state
ipamD could be made to retry instead of crash on startup, although if it were to crash some other way the pod would get stuck the same way (so possibly one of the previous two would be a good idea as well)

mogren · 2019-08-07T03:42:42Z

@drakedevel You are right about the time-out.. I had another approach in this PR: #576

mogren · 2019-08-07T03:45:12Z

@drakedevel An image with that change is available in my ECR repo, 973117571331.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0-rc1 if you would like to test in your dev cluster. That image is latest mainline + #576. If ipamd fails to come up, it should just exit and restart.

drakedevel · 2019-08-07T04:30:51Z

@mogren works like a charm! Rolled out a fresh cluster the exact same way but with the new image, got the same timeout error, and the pod restarted as expected until kube-proxy started up.

2019-08-07T04:24:09.106Z [INFO]	Starting L-IPAMD v1.6.0-rc1  ...
2019-08-07T04:24:39.123Z [INFO]	Testing communication with server
2019-08-07T04:25:09.124Z [INFO]	Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-08-07T04:25:09.124Z [ERROR]	Failed to create client: error communicating with apiserver: Get https://100.127.0.1:443/version?timeout=32s: dial tcp 100.127.0.1:443: i/o timeout
2019-08-07T04:25:10.496Z [INFO]	Starting L-IPAMD v1.6.0-rc1  ...
2019-08-07T04:25:40.498Z [INFO]	Testing communication with server
2019-08-07T04:26:10.076Z [INFO]	Starting L-IPAMD v1.6.0-rc1  ...
2019-08-07T04:26:10.123Z [INFO]	Testing communication with server
2019-08-07T04:26:10.124Z [INFO]	Running with Kubernetes cluster version: v1.15. git version: v1.15.2. git tree state: clean. commit: f6278300bebbb750328ac16ee6dd3aa7d3549568. platform: linux/amd64
2019-08-07T04:26:10.124Z [INFO]	Communication with server successful

mogren · 2019-08-07T04:44:31Z

@drakedevel Thanks a lot for verifying!

drakedevel · 2019-08-07T04:46:12Z

No problem at all, thanks for the quick fix! 😄

seancurran157 · 2019-08-07T20:51:22Z

@mogren I tried using the rc and we are getting the following error on some pods

aws-node logs

2019-08-07T20:46:10.637Z [DEBUG]	Handle corev1.Node: ip-10-12-175-148.ec2.internal, map[node.alpha.kubernetes.io/ttl:0 volumes.kubernetes.io/controller-managed-attach-detach:true], map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:m5.2xlarge beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-east-1 failure-domain.beta.kubernetes.io/zone:us-east-1c k8s.amazonaws.com/eniConfig:us-east-1c kubernetes.io/hostname:ip-10-12-175-148.ec2.internal]
2019-08-07T20:46:11.392Z [INFO]	Received DelNetwork for IP <nil>, Pod datadog-zpfxm, Namespace cloudplatform-system, Container 53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716
2019-08-07T20:46:11.392Z [DEBUG]	UnassignPodIPv4Address: IP address pool stats: total:28, assigned 1, pod(Name: datadog-zpfxm, Namespace: cloudplatform-system, Container 53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716)
2019-08-07T20:46:11.392Z [WARN]	UnassignPodIPv4Address: Failed to find pod datadog-zpfxm namespace cloudplatform-system Container 53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716
2019-08-07T20:46:11.392Z [DEBUG]	UnassignPodIPv4Address: IP address pool stats: total:28, assigned 1, pod(Name: datadog-zpfxm, Namespace: cloudplatform-system, Container )
2019-08-07T20:46:11.392Z [WARN]	UnassignPodIPv4Address: Failed to find pod datadog-zpfxm namespace cloudplatform-system Container
2019-08-07T20:46:11.392Z [INFO]	Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod

kubectl describe po datadog-zpfxm -n cloudplatform-system

  Warning  FailedCreatePodSandBox  13m                kubelet, ip-10-12-172-34.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716" network for pod "datadog-zpfxm": NetworkPlugin cni failed to set up pod "datadog-zpfxm_cloudplatform-system" network: add cmd: failed to assign an IP address to container, failed to clean up sandbox container "53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716" network for pod "datadog-zpfxm": NetworkPlugin cni failed to teardown pod "datadog-zpfxm_cloudplatform-system" network: del cmd: failed to process delete request]
  Normal   SandboxChanged          3m (x46 over 13m)  kubelet, ip-10-12-172-34.ec2.internal  Pod sandbox changed, it will be killed and re-created.

this error occurs when a new worker is introduced.

mogren · 2019-08-08T17:20:06Z

Thanks @seancurran157 for reporting, I'll try to reproduce it ASAP.

seancurran157 · 2019-08-14T13:55:30Z

@mogren any luck on reproducing?

mogren · 2019-08-14T19:49:46Z

@seancurran157 Sorry, not yet. Got pulled in to work on some other issues. Have you tried with v1.5.3?

mogren · 2019-09-27T23:54:47Z

This should have been solved in v1.5.3. Please reopen if this is still an issue.

mogren added bug needs investigation priority/P0 Highest priority. Someone needs to actively work on this. labels Aug 7, 2019

mogren closed this as completed Aug 7, 2019

mogren mentioned this issue Aug 7, 2019

Copy the binary and config after ipamd is ready #576

Merged

mogren reopened this Aug 7, 2019

mogren mentioned this issue Aug 7, 2019

Release v1.5.2 #574

Merged

rifelpet mentioned this issue Aug 7, 2019

Upgrade Amazon VPC CNI plugin to 1.5.2 kubernetes/kops#7383

Closed

mogren closed this as completed Sep 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Waiting for ipamd health check deadlocks node bootstrapping outside EKS #575

Waiting for ipamd health check deadlocks node bootstrapping outside EKS #575

drakedevel commented Aug 7, 2019

drakedevel commented Aug 7, 2019 •

edited

Loading

mogren commented Aug 7, 2019

drakedevel commented Aug 7, 2019 •

edited

Loading

mogren commented Aug 7, 2019

mogren commented Aug 7, 2019

drakedevel commented Aug 7, 2019

mogren commented Aug 7, 2019

drakedevel commented Aug 7, 2019

seancurran157 commented Aug 7, 2019 •

edited

Loading

mogren commented Aug 8, 2019

seancurran157 commented Aug 14, 2019

mogren commented Aug 14, 2019

mogren commented Sep 27, 2019

Waiting for ipamd health check deadlocks node bootstrapping outside EKS #575

Waiting for ipamd health check deadlocks node bootstrapping outside EKS #575

Comments

drakedevel commented Aug 7, 2019

drakedevel commented Aug 7, 2019 • edited Loading

mogren commented Aug 7, 2019

drakedevel commented Aug 7, 2019 • edited Loading

mogren commented Aug 7, 2019

mogren commented Aug 7, 2019

drakedevel commented Aug 7, 2019

mogren commented Aug 7, 2019

drakedevel commented Aug 7, 2019

seancurran157 commented Aug 7, 2019 • edited Loading

mogren commented Aug 8, 2019

seancurran157 commented Aug 14, 2019

mogren commented Aug 14, 2019

mogren commented Sep 27, 2019

drakedevel commented Aug 7, 2019 •

edited

Loading

drakedevel commented Aug 7, 2019 •

edited

Loading

seancurran157 commented Aug 7, 2019 •

edited

Loading