Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-node pods do not always start successfully with custom networking #1385

Closed
asheldon opened this issue Feb 18, 2021 · 13 comments
Closed

aws-node pods do not always start successfully with custom networking #1385

asheldon opened this issue Feb 18, 2021 · 13 comments
Assignees
Labels

Comments

@asheldon
Copy link
Contributor

asheldon commented Feb 18, 2021

What happened:
aws-node pods crash on start intermittently with custom networking enabled.

Attach logs

{"level":"info","ts":"2021-02-18T19:10:43.050Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-02-18T19:10:43.051Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-02-18T19:10:43.063Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-02-18T19:10:43.064Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
time="2021-02-18T19:10:43Z" level=error msg="failed to initialize service object for operator metrics: OPERATOR_NAME must be set"
{"level":"info","ts":"2021-02-18T19:10:45.077Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2021-02-18T19:10:45.079Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2021-02-18T19:10:45.079Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}
asheldon@C02DVAQHMD6R azure % kubectl logs -n kube-system aws-node-28dp8  --previous
{"level":"info","ts":"2021-02-18T19:09:09.252Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2021-02-18T19:09:09.253Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-02-18T19:09:09.263Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-02-18T19:09:09.264Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E0218 19:09:39.289057      10 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)
time="2021-02-18T19:09:39Z" level=error msg="failed to initialize service object for operator metrics: OPERATOR_NAME must be set"
time="2021-02-18T19:09:39Z" level=error msg="failed to get resource client for (apiVersion:crd.k8s.amazonaws.com/v1alpha1, kind:ENIConfig, ns:): failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(crd.k8s.amazonaws.com/v1alpha1, Kind=ENIConfig): the cache has not been filled yet"
panic: failed to get resource type: failed to get the resource REST mapping for GroupVersionKind(crd.k8s.amazonaws.com/v1alpha1, Kind=ENIConfig): the cache has not been filled yet
goroutine 55 [running]:
github.com/operator-framework/operator-sdk/pkg/sdk.Watch(0x55b3157e6b36, 0x1e, 0x55b3157c624d, 0x9, 0x0, 0x0, 0x12a05f200, 0x0, 0x0, 0x0)
	/go/pkg/mod/github.com/operator-framework/operator-sdk@v0.0.7/pkg/sdk/api.go:49 +0x46e
github.com/aws/amazon-vpc-cni-k8s/pkg/eniconfig.(*ENIConfigController).Start(0xc000099ce0)
	/go/src/github.com/aws/amazon-vpc-cni-k8s/pkg/eniconfig/eniconfig.go:164 +0x196
created by main._main
	/go/src/github.com/aws/amazon-vpc-cni-k8s/cmd/aws-k8s-agent/main.go:49 +0x54d

What you expected to happen:
No crashes / restarts
How to reproduce it (as minimally and precisely as possible):

  1. Use custom networking
  2. Launch instances

Anything else we need to know?:
Usually works after one or two retries

Environment:

  • Kubernetes version (use kubectl version):

EKS 1.19

  • CNI Version

1.7.8

  • OS (e.g: cat /etc/os-release):

Amazon Linux 2

  • Kernel (e.g. uname -a):
    5.4.91-41.139.amzn2
@asheldon asheldon added the bug label Feb 18, 2021
@jayanthvn
Copy link
Contributor

Hi @asheldon

We will try to repro and debug this further. Can you please share me ipamd logs -sudo bash /opt/cni/bin/aws-cni-support.sh. You can email it to varavaj@amazon.com.

Thanks.

@asheldon
Copy link
Contributor Author

Do you execute this command in the pod or on the node?

@couralex6
Copy link
Contributor

Do you execute this command in the pod or on the node?

On the node @asheldon

@asheldon
Copy link
Contributor Author

asheldon commented Mar 2, 2021

Logs have been sent over.

@jayanthvn
Copy link
Contributor

Thanks @asheldon. Will look into it and update you.

@jayanthvn
Copy link
Contributor

Hi @asheldon

Can you also please share o/p of kubectl describe eniconfig.crd.k8s.amazonaws.com/<eniconfig-name>

Thank you!

@asheldon
Copy link
Contributor Author

asheldon commented Mar 9, 2021

Name:         us-east-1a
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  crd.k8s.amazonaws.com/v1alpha1
Kind:         ENIConfig
Metadata:
  Creation Timestamp:  2021-03-08T21:18:06Z
  Generation:          1
  Managed Fields:
    API Version:  crd.k8s.amazonaws.com/v1alpha1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:securityGroups:
        f:subnet:
    Manager:         Go-http-client
    Operation:       Update
    Time:            2021-03-08T21:18:06Z
  Resource Version:  1036
  Self Link:         /apis/crd.k8s.amazonaws.com/v1alpha1/eniconfigs/us-east-1a
  UID:               343873b8-deeb-41a4-b440-af3c0e99f458
Spec:
  Security Groups:
    sg-REDACTED
    sg-REDACTED
    
  Subnet:  subnet-REDACTED
Events:    <none>

@jayanthvn
Copy link
Contributor

Hi,

The error - (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout) seems like aws-node is not able to reach API Server and this happens if kube-proxy starts later. Can you please check kube-proxy logs and confirm if aws-node started before kube-proxy was up?

@asheldon
Copy link
Contributor Author

I see an aws-node startup failure timestamped to 00:43:14.340664

https://172.20.0.1:443/api?timeout=32s

and kube-proxy records

I0313 00:42:43.943131 1 service.go:396] Adding new service port "default/kubernetes:https" at 172.20.0.1:443/TCP

This gap is 30.39 seconds which is less than the 32 second timeout suggested in the log. This makes me think that aws-node started the query ~1.5 seconds before kube-proxy was ready for it, hung for ~32 seconds, then crashed.

@jayanthvn
Copy link
Contributor

Hi @asheldon

Thanks for checking and yes it seems like aws-node came up before kube-proxy. Also can you please check if delay in kube-proxy is because of this -

{"log":"I0904 13:53:07.163201       1 feature_gate.go:243] feature gates: \u0026{map[]}\n","stream":"stderr","time":"2020-09-04T13:53:07.164205369Z"}
{"log":"E0904 13:53:07.657821       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:07.657910856Z"}
{"log":"E0904 13:53:08.817715       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:08.817826081Z"}
{"log":"E0904 13:53:11.028140       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:11.028236757Z"}
{"log":"E0904 13:53:15.789086       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:15.789186713Z"}
{"log":"E0904 13:53:24.954484       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:24.954581764Z"}
{"log":"E0904 13:53:43.712021       1 node.go:124] Failed to retrieve node info: nodes \"ip-172-31-215-37.xx.com\" not found\n","stream":"stderr","time":"2020-09-04T13:53:43.712148589Z"}
{"log":"I0904 13:53:43.712044       1 server_others.go:140] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag\n","stream":"stderr","time":"2020-09-04T13:53:43.712191645Z"}

@asheldon
Copy link
Contributor Author

There are no delays like that in kube-proxy startup on my node. The first timestamp is
I0313 00:42:43.609192 and the I0313 00:42:43.943131 timestamp is less than a second later.

@jayanthvn
Copy link
Contributor

This is similar to #1078 . Please see the comment here - #1078 (comment)

@jayanthvn jayanthvn assigned jayanthvn and unassigned couralex6 Jun 10, 2021
@jayanthvn
Copy link
Contributor

Kubeproxy expects to find a node matching its hostname during startup and falls back to localhost eventually. During this time, AWS-node will crash loop until kubeproxy starts successfully and becomes available. Kube-proxy daemonset should be patched as described here: https://gist.github.com/M00nF1sh/84d380b4e08017a5bc958658f7010914 if the node's hostname doesn't match the Kubernetes node name. We are also working internally to make this change available by default with kubeproxy.

/cc @M00nF1sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants