aws-node pods do not always start successfully with custom networking #1385

asheldon · 2021-02-18T19:29:59Z

What happened:
aws-node pods crash on start intermittently with custom networking enabled.

Attach logs

{"level":"info","ts":"2021-02-18T19:10:43.050Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-02-18T19:10:43.051Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-02-18T19:10:43.063Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-02-18T19:10:43.064Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
time="2021-02-18T19:10:43Z" level=error msg="failed to initialize service object for operator metrics: OPERATOR_NAME must be set"
{"level":"info","ts":"2021-02-18T19:10:45.077Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2021-02-18T19:10:45.079Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2021-02-18T19:10:45.079Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
asheldon@C02DVAQHMD6R azure % kubectl logs -n kube-system aws-node-28dp8  --previous
{"level":"info","ts":"2021-02-18T19:09:09.252Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-02-18T19:09:09.253Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-02-18T19:09:09.263Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-02-18T19:09:09.264Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E0218 19:09:39.289057      10 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)
time="2021-02-18T19:09:39Z" level=error msg="failed to initialize service object for operator metrics: OPERATOR_NAME must be set"
time="2021-02-18T19:09:39Z" level=error msg="failed to get resource client for (apiVersion:crd.k8s.amazonaws.com/v1alpha1, kind:ENIConfig, ns:): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(crd.k8s.amazonaws.com/v1alpha1, Kind=ENIConfig): the cache has not been filled yet"
panic: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(crd.k8s.amazonaws.com/v1alpha1, Kind=ENIConfig): the cache has not been filled yet
goroutine 55 [running]:
github.com/operator-framework/operator-sdk/pkg/sdk.Watch(0x55b3157e6b36, 0x1e, 0x55b3157c624d, 0x9, 0x0, 0x0, 0x12a05f200, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/operator-framework/operator-sdk@v0.0.7/pkg/sdk/api.go:49 +0x46e
github.com/aws/amazon-vpc-cni-k8s/pkg/eniconfig.(*ENIConfigController).Start(0xc000099ce0)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/eniconfig/eniconfig.go:164 +0x196
created by main._main
	/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:49 +0x54d

What you expected to happen:
No crashes / restarts
How to reproduce it (as minimally and precisely as possible):

Use custom networking
Launch instances

Anything else we need to know?:
Usually works after one or two retries

Environment:

Kubernetes version (use kubectl version):

EKS 1.19

CNI Version

1.7.8

OS (e.g: cat /etc/os-release):

Amazon Linux 2

Kernel (e.g. uname -a):
5.4.91-41.139.amzn2

The text was updated successfully, but these errors were encountered:

jayanthvn · 2021-02-26T15:31:52Z

Hi @asheldon

We will try to repro and debug this further. Can you please share me ipamd logs -sudo bash /opt/cni/bin/aws-cni-support.sh. You can email it to varavaj@amazon.com.

Thanks.

asheldon · 2021-02-26T21:12:55Z

Do you execute this command in the pod or on the node?

couralex6 · 2021-02-26T21:33:00Z

Do you execute this command in the pod or on the node?

On the node @asheldon

asheldon · 2021-03-02T07:09:22Z

Logs have been sent over.

jayanthvn · 2021-03-02T16:53:39Z

Thanks @asheldon. Will look into it and update you.

jayanthvn · 2021-03-04T01:29:27Z

Hi @asheldon

Can you also please share o/p of kubectl describe eniconfig.crd.k8s.amazonaws.com/<eniconfig-name>

Thank you!

asheldon · 2021-03-09T01:07:59Z

Name:         us-east-1a
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  crd.k8s.amazonaws.com/v1alpha1
Kind:         ENIConfig
Metadata:
  Creation Timestamp:  2021-03-08T21:18:06Z
  Generation:          1
  Managed Fields:
    API Version:  crd.k8s.amazonaws.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:securityGroups:
        f:subnet:
    Manager:         Go-http-client
    Operation:       Update
    Time:            2021-03-08T21:18:06Z
  Resource Version:  1036
  Self Link:         /apis/crd.k8s.amazonaws.com/v1alpha1/eniconfigs/us-east-1a
  UID:               343873b8-deeb-41a4-b440-af3c0e99f458
Spec:
  Security Groups:
    sg-REDACTED
    sg-REDACTED
    
  Subnet:  subnet-REDACTED
Events:    <none>

jayanthvn · 2021-03-24T18:28:47Z

Hi,

The error - (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout) seems like aws-node is not able to reach API Server and this happens if kube-proxy starts later. Can you please check kube-proxy logs and confirm if aws-node started before kube-proxy was up?

asheldon · 2021-03-24T21:12:05Z

I see an aws-node startup failure timestamped to 00:43:14.340664

https://172.20.0.1:443/api?timeout=32s

and kube-proxy records

I0313 00:42:43.943131 1 service.go:396] Adding new service port "default/kubernetes:https" at 172.20.0.1:443/TCP

This gap is 30.39 seconds which is less than the 32 second timeout suggested in the log. This makes me think that aws-node started the query ~1.5 seconds before kube-proxy was ready for it, hung for ~32 seconds, then crashed.

jayanthvn · 2021-03-24T21:27:33Z

Hi @asheldon

Thanks for checking and yes it seems like aws-node came up before kube-proxy. Also can you please check if delay in kube-proxy is because of this -

{"log":"I0904 13:53:07.163201       1 feature_gate.go:243] feature gates: \u0026{map[]}\n","stream":"stderr","time":"2020-09-04T13:53:07.164205369Z"}
{"log":"E0904 13:53:07.657821       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:07.657910856Z"}
{"log":"E0904 13:53:08.817715       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:08.817826081Z"}
{"log":"E0904 13:53:11.028140       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:11.028236757Z"}
{"log":"E0904 13:53:15.789086       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:15.789186713Z"}
{"log":"E0904 13:53:24.954484       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:24.954581764Z"}
{"log":"E0904 13:53:43.712021       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:43.712148589Z"}
{"log":"I0904 13:53:43.712044       1 server_others.go:140] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag\n","stream":"stderr","time":"2020-09-04T13:53:43.712191645Z"}

asheldon · 2021-03-24T22:03:19Z

There are no delays like that in kube-proxy startup on my node. The first timestamp is
I0313 00:42:43.609192 and the I0313 00:42:43.943131 timestamp is less than a second later.

jayanthvn · 2021-04-09T17:56:00Z

This is similar to #1078 . Please see the comment here - #1078 (comment)

jayanthvn · 2021-08-02T23:35:28Z

Kubeproxy expects to find a node matching its hostname during startup and falls back to localhost eventually. During this time, AWS-node will crash loop until kubeproxy starts successfully and becomes available. Kube-proxy daemonset should be patched as described here: https://gist.github.com/M00nF1sh/84d380b4e08017a5bc958658f7010914 if the node's hostname doesn't match the Kubernetes node name. We are also working internally to make this change available by default with kubeproxy.

/cc @M00nF1sh

asheldon added the bug label Feb 18, 2021

jayanthvn assigned couralex6 Mar 3, 2021

jayanthvn assigned jayanthvn and unassigned couralex6 Jun 10, 2021

jayanthvn closed this as completed Aug 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aws-node pods do not always start successfully with custom networking #1385

aws-node pods do not always start successfully with custom networking #1385

asheldon commented Feb 18, 2021 •

edited

Loading

jayanthvn commented Feb 26, 2021

asheldon commented Feb 26, 2021

couralex6 commented Feb 26, 2021

asheldon commented Mar 2, 2021

jayanthvn commented Mar 2, 2021

jayanthvn commented Mar 4, 2021

asheldon commented Mar 9, 2021

jayanthvn commented Mar 24, 2021

asheldon commented Mar 24, 2021

jayanthvn commented Mar 24, 2021

asheldon commented Mar 24, 2021

jayanthvn commented Apr 9, 2021

jayanthvn commented Aug 2, 2021

aws-node pods do not always start successfully with custom networking #1385

aws-node pods do not always start successfully with custom networking #1385

Comments

asheldon commented Feb 18, 2021 • edited Loading

jayanthvn commented Feb 26, 2021

asheldon commented Feb 26, 2021

couralex6 commented Feb 26, 2021

asheldon commented Mar 2, 2021

jayanthvn commented Mar 2, 2021

jayanthvn commented Mar 4, 2021

asheldon commented Mar 9, 2021

jayanthvn commented Mar 24, 2021

asheldon commented Mar 24, 2021

jayanthvn commented Mar 24, 2021

asheldon commented Mar 24, 2021

jayanthvn commented Apr 9, 2021

jayanthvn commented Aug 2, 2021

asheldon commented Feb 18, 2021 •

edited

Loading