Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Waiting for ipamd health check deadlocks node bootstrapping outside EKS #575

Closed
drakedevel opened this issue Aug 7, 2019 · 13 comments
Closed
Labels
bug needs investigation priority/P0 Highest priority. Someone needs to actively work on this.

Comments

@drakedevel
Copy link
Contributor

Tested 1.5.2 on Kubernetes 1.15.2

The change in #553 introduced a node bootstrapping problem on our kubeadm test cluster. With this change, nodes get tainted with node.kubernetes.io/not-ready until the ipamD is healthy. ipamD can't become healthy until it's able to reach the API server. On at least kubeadm and kops clusters, the API server is a ClusterIP, which requires kube-proxy to be up and running. This circular dependency means that nodes booting up simply sit there forever.

@mogren mogren added bug needs investigation priority/P0 Highest priority. Someone needs to actively work on this. labels Aug 7, 2019
@drakedevel
Copy link
Contributor Author

drakedevel commented Aug 7, 2019

Sorry, this explanation is bogus -- I misread my kubectl output and kube-proxy is running fine. I am still seeing this in ipamd.log (with the result that the node is NotReady):

2019-08-07T02:24:04.835Z [INFO] Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-08-07T02:24:04.835Z [ERROR]        Failed to create client: error communicating with apiserver: Get https://100.127.0.1:443/version?timeout=32s: dial tcp 100.127.0.1:443: i/o timeout

The cluster worked fine with 1.5.0 w/ Kubernetes 1.15.0, but I need to investigate more tomorrow. Will update here when I figure out what happened -- sorry for the noise! Feel free to close if you want, I can reopen.

@mogren
Copy link
Contributor

mogren commented Aug 7, 2019

@drakedevel Ok, thanks for the follow up.

We did test v1.5.2 quite a lot, both on new clusters and upgrading from older versions. I'll close this issue since since kube-proxy does start, but feel free to open another issue if you can't figure out why ipamd can't talk to the api-server.

@mogren mogren closed this as completed Aug 7, 2019
@drakedevel
Copy link
Contributor Author

drakedevel commented Aug 7, 2019

@mogren It looks like the actual issue is that the aws-node pods started up faster than the kube-proxy pods, so the API server was unreachable initially. The aws-k8s-agent process apparently crashes when this happens (the process was gone when I looked), but install-aws.sh doesn't notice and continues to poll forever.

Everything works fine if the pod is manually deleted, but the node is broken until then as nothing will automatically get the pod out of this state. In 1.5.1, a aws-k8s-agent crash always results in the pod exiting and getting restarted, so if it "wins" the race against kube-proxy it's at least self-healing that way.

Ideas:

  • install-aws.sh could detect the child process crashing before becoming healthy and bail
  • install-aws.sh could have a timeout on the initial wait
  • The readiness probe (commented out in the example config) would work as a fallback to prevent a pod from getting stuck forever in this state
  • ipamD could be made to retry instead of crash on startup, although if it were to crash some other way the pod would get stuck the same way (so possibly one of the previous two would be a good idea as well)

@mogren
Copy link
Contributor

mogren commented Aug 7, 2019

@drakedevel You are right about the time-out.. I had another approach in this PR: #576

@mogren mogren reopened this Aug 7, 2019
@mogren
Copy link
Contributor

mogren commented Aug 7, 2019

@drakedevel An image with that change is available in my ECR repo, 973117571331.dkr.ecr.us-west-2.amazonaws.com/amazon-k8s-cni:v1.6.0-rc1 if you would like to test in your dev cluster. That image is latest mainline + #576. If ipamd fails to come up, it should just exit and restart.

@drakedevel
Copy link
Contributor Author

@mogren works like a charm! Rolled out a fresh cluster the exact same way but with the new image, got the same timeout error, and the pod restarted as expected until kube-proxy started up.

2019-08-07T04:24:09.106Z [INFO]	Starting L-IPAMD v1.6.0-rc1  ...
2019-08-07T04:24:39.123Z [INFO]	Testing communication with server
2019-08-07T04:25:09.124Z [INFO]	Failed to communicate with K8S Server. Please check instance security groups or http proxy setting
2019-08-07T04:25:09.124Z [ERROR]	Failed to create client: error communicating with apiserver: Get https://100.127.0.1:443/version?timeout=32s: dial tcp 100.127.0.1:443: i/o timeout
2019-08-07T04:25:10.496Z [INFO]	Starting L-IPAMD v1.6.0-rc1  ...
2019-08-07T04:25:40.498Z [INFO]	Testing communication with server
2019-08-07T04:26:10.076Z [INFO]	Starting L-IPAMD v1.6.0-rc1  ...
2019-08-07T04:26:10.123Z [INFO]	Testing communication with server
2019-08-07T04:26:10.124Z [INFO]	Running with Kubernetes cluster version: v1.15. git version: v1.15.2. git tree state: clean. commit: f6278300bebbb750328ac16ee6dd3aa7d3549568. platform: linux/amd64
2019-08-07T04:26:10.124Z [INFO]	Communication with server successful

@mogren
Copy link
Contributor

mogren commented Aug 7, 2019

@drakedevel Thanks a lot for verifying!

@drakedevel
Copy link
Contributor Author

No problem at all, thanks for the quick fix! 😄

@mogren mogren mentioned this issue Aug 7, 2019
@seancurran157
Copy link

seancurran157 commented Aug 7, 2019

@mogren I tried using the rc and we are getting the following error on some pods

aws-node logs

2019-08-07T20:46:10.637Z [DEBUG]	Handle corev1.Node: ip-10-12-175-148.ec2.internal, map[node.alpha.kubernetes.io/ttl:0 volumes.kubernetes.io/controller-managed-attach-detach:true], map[beta.kubernetes.io/arch:amd64 beta.kubernetes.io/instance-type:m5.2xlarge beta.kubernetes.io/os:linux failure-domain.beta.kubernetes.io/region:us-east-1 failure-domain.beta.kubernetes.io/zone:us-east-1c k8s.amazonaws.com/eniConfig:us-east-1c kubernetes.io/hostname:ip-10-12-175-148.ec2.internal]
2019-08-07T20:46:11.392Z [INFO]	Received DelNetwork for IP <nil>, Pod datadog-zpfxm, Namespace cloudplatform-system, Container 53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716
2019-08-07T20:46:11.392Z [DEBUG]	UnassignPodIPv4Address: IP address pool stats: total:28, assigned 1, pod(Name: datadog-zpfxm, Namespace: cloudplatform-system, Container 53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716)
2019-08-07T20:46:11.392Z [WARN]	UnassignPodIPv4Address: Failed to find pod datadog-zpfxm namespace cloudplatform-system Container 53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716
2019-08-07T20:46:11.392Z [DEBUG]	UnassignPodIPv4Address: IP address pool stats: total:28, assigned 1, pod(Name: datadog-zpfxm, Namespace: cloudplatform-system, Container )
2019-08-07T20:46:11.392Z [WARN]	UnassignPodIPv4Address: Failed to find pod datadog-zpfxm namespace cloudplatform-system Container
2019-08-07T20:46:11.392Z [INFO]	Send DelNetworkReply: IPv4Addr , DeviceNumber: 0, err: datastore: unknown pod

kubectl describe po datadog-zpfxm -n cloudplatform-system

  Warning  FailedCreatePodSandBox  13m                kubelet, ip-10-12-172-34.ec2.internal  Failed create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716" network for pod "datadog-zpfxm": NetworkPlugin cni failed to set up pod "datadog-zpfxm_cloudplatform-system" network: add cmd: failed to assign an IP address to container, failed to clean up sandbox container "53a548c383703bc4af87893dd24e8f030cb04ad78808a09f07e6eb4e795fe716" network for pod "datadog-zpfxm": NetworkPlugin cni failed to teardown pod "datadog-zpfxm_cloudplatform-system" network: del cmd: failed to process delete request]
  Normal   SandboxChanged          3m (x46 over 13m)  kubelet, ip-10-12-172-34.ec2.internal  Pod sandbox changed, it will be killed and re-created.

this error occurs when a new worker is introduced.

@mogren
Copy link
Contributor

mogren commented Aug 8, 2019

Thanks @seancurran157 for reporting, I'll try to reproduce it ASAP.

@seancurran157
Copy link

@mogren any luck on reproducing?

@mogren
Copy link
Contributor

mogren commented Aug 14, 2019

@seancurran157 Sorry, not yet. Got pulled in to work on some other issues. Have you tried with v1.5.3?

@mogren
Copy link
Contributor

mogren commented Sep 27, 2019

This should have been solved in v1.5.3. Please reopen if this is still an issue.

@mogren mogren closed this as completed Sep 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs investigation priority/P0 Highest priority. Someone needs to actively work on this.
Projects
None yet
Development

No branches or pull requests

3 participants