EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value" #1078

max-rocket-internet · 2020-07-09T10:10:13Z

We see the aws-node pods crash on startup sometimes with this logged:

Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0708 16:29:03.884330       6 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://172.20.0.1:443/api?timeout=32s: dial tcp 172.20.0.1:443: i/o timeout)
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

After starting and crashing the pod is then restarted and runs fine. About half of the aws-node pods do this.

The text was updated successfully, but these errors were encountered:

max-rocket-internet · 2020-07-10T13:21:37Z

There's quite a few similar issues but they aren't exactly the same error. Not sure if this is a duplicate issue or not...

#1049 (comment) (logs don't match)
#1055 (perpetual problem)
#1054 (different Reason for last state of pod)

max-rocket-internet · 2020-07-15T14:55:44Z

Same for v1.6.3 😐

mogren · 2020-07-15T18:10:54Z

@max-rocket-internet Thanks for testing with v1.6.3 as well. I suspect the issue here is that kube-proxy have not yet set up the iptables rules to resolve 172.20.0.1, and the old client code we still use (See #522 for details) doesn't handle this well.

I guess a work-around to avoid the restarts would be to retry a few times before returning an error, instead of letting kubelet do the retries. That would at least hide these errors from the logs.

max-rocket-internet · 2020-07-17T09:41:57Z

Makes sense but why wasn't this a problem with v1.5.x releases? Or is it related to moving to EKS/AMI 1.16?

hristov-hs · 2020-07-24T09:19:25Z

we had the same issue after upgrading from eks 1.15 to 1.16. We were just bumping the image version inside DaemonSet to 1.6.X. What solved our issue is to apply the full yaml provided by aws doc:
https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/release-1.6/config/v1.6/aws-k8s-cni.yaml

it did changes both to the DaemonSet and the ClusterRole.

Good luck !

max-rocket-internet · 2020-07-29T15:17:07Z

But there's no changes to that file between the v1.6.2 and v1.6.3 releases? https://github.com/aws/amazon-vpc-cni-k8s/commits/master/config/v1.6/aws-k8s-cni.yaml

max-rocket-internet · 2020-08-04T09:22:11Z

Any update @mogren?

mogren · 2020-08-04T21:45:13Z

@max-rocket-internet Hey, sorry for the lack of updates on this. Been out for a bit without much network access, so haven't been able to track this one down. I agree that there is no config change between v1.6.2 and v1.6.3, but since v1.5.x, we have updated the readiness and liveness probe configs.

Between kubernetes 1.15 and 1.16 kube-proxy has changed, so that could be related. We have not yet been able to reproduce this yet doing master upgrades.

schmitz-chris · 2020-08-07T13:58:42Z

We had the same problem. Updating master and nodegroups from 1.15 to 1.16. We had to fix the version of kubeproxy
(kube-proxy:v1.16.13 -> kube-proxy:v1.16.12) and recreate nodes

marcelbirkner · 2020-08-07T13:59:15Z

^^

kubectl set image daemonset.apps/kube-proxy \
    -n kube-system \
    kube-proxy=602401143452.dkr.ecr.us-west-2.amazonaws.com/eks/kube-proxy:v1.16.12

Niksko · 2020-09-01T06:09:52Z

Just an FYI, we were encountering this issue on k8s 1.17, kube-proxy 1.16.13, aws cni 1.6.3 and 1.7.1. Turns out the issue was a bad PSP for kube-proxy that had readOnlyRootFilesystem: true. Kube proxy logs will show that it's unable to configure some iptables rules due to the readonly root fs. However this doesn't seem to crash kube-proxy. readOnlyRootFilesystem: false fixes things.

max-rocket-internet · 2020-09-02T10:15:20Z

So two suggestions now:

kube-proxy should have readOnlyRootFilesystem: false
Downgrade kube-proxy to kube-proxy:v1.16.12

Any confirmation from AWS about the issue and resolution?

mogren · 2020-09-02T18:31:29Z

@max-rocket-internet @Niksko Could you please provide the kube-proxy configs that you have where this issue shows up? Is this on EKS clusters? What Kubernetes version? Do you have some custom PSP for these clusters? Are the worker nodes using a custom AMI or have some script locking iptables on startup?

Is there something you know that changed in 1.16.13 that makes starting kube-proxy take slightly longer time, triggering this issue?

Niksko · 2020-09-03T00:41:22Z

@mogren this is on EKS, version 1.17. We discovered this as part of adding custom PSPs to all components. No scripts locking iptables on startup, using the standard EKS AMIs.

The behaviour we were seeing was that the aws-node pod never became ready, and was crash-looping. Apologies if that caused any confusion. I think it's not unreasonable to conclude that:

kube-proxy sets up iptables rules that are required by aws-node
setting the filesystem to readonly on kube-proxy causes these rules to never be set up, so aws-node crash loops
a race condition between kube-proxy and aws-node could cause aws-node to come up before the iptables rules have been configured, causing an initial crash before working as normal (when kube-proxy creates the rules).

tibin-mfl · 2020-09-04T14:06:26Z

I am facing a similar issue, aws-node pod restarts 1 time on every node startup, it will work after that.
Error:

kubectl logs aws-node-f8tw6   --previous -n kube-system

Copying portmap binary ... Starting IPAM daemon in the background ... ok.
ERROR: logging before flag.Parse: E0904 13:53:37.150548       8 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://10.100.0.1:443/api?timeout=32s: dial tcp 10.100.0.1:443: i/o timeout)
Checking for IPAM connectivity ...  failed.
Timed out waiting for IPAM daemon to start:

EKS Version: 1.17
Platform version: eks.2
Kube-proxy: v1.17.9-eksbuild.1
aws-node: v1.6.3-eksbuild.1

I tried adding sleep in aws-node to rule that this is happening because kube-proxy is taking time to start, verified that kube-proxy started before aws-node.

jayanthvn · 2020-09-04T18:54:11Z

Hi @tibin-mfl

Can you please share cni logs - https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and also kube-proxy pod logs. It will help us verify why kube-proxy is having a delay start.

Thanks.

mogopz · 2020-09-06T09:27:25Z

Hi, just wanted to chime in and we're seeing the same thing. Like others have mentioned the pod seems to restart once when the node first starts up and it's fine after that. We're not using any custom PSPs.

EKS version: 1.17
AMI version: v1.17.9-eks-4c6976
kube-proxy version: 1.17.7
CNI version: 1.6.3

I can see these errors in kube-proxy logs on one of the nodes where aws-node restarted:

udpIdleTimeout: 250ms: v1alpha1.KubeProxyConfiguration.Conntrack: v1alpha1.KubeProxyConntrackConfiguration.ReadObject: found unknown field: max, error found in #10 byte of ...|ck":{"max":0,"maxPer|..., bigger context ...|":"","configSyncPeriod":"15m0s","conntrack":{"max":0,"maxPerCore":32768,"min":131072,"tcpCloseWaitTi|...
I0905 23:12:39.826265       7 feature_gate.go:243] feature gates: &{map[]}
E0905 23:12:40.388938       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:41.516857       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:43.567271       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:48.167166       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:12:56.325941       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
E0905 23:13:14.684106       7 node.go:124] Failed to retrieve node info: nodes "ip-10-0-212-179" not found
I0905 23:13:14.684134       7 server_others.go:140] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag
I0905 23:13:14.684150       7 server_others.go:145] Using iptables Proxier.
W0905 23:13:14.684259       7 proxier.go:286] clusterCIDR not specified, unable to distinguish between internal and external traffic
I0905 23:13:14.684410       7 server.go:571] Version: v1.17.7
I0905 23:13:14.684773       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I0905 23:13:14.684803       7 conntrack.go:52] Setting nf_conntrack_max to 131072
I0905 23:13:14.684850       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I0905 23:13:14.684894       7 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I0905 23:13:14.685092       7 config.go:313] Starting service config controller
I0905 23:13:14.685101       7 shared_informer.go:197] Waiting for caches to sync for service config
I0905 23:13:14.685139       7 config.go:131] Starting endpoints config controller
I0905 23:13:14.685149       7 shared_informer.go:197] Waiting for caches to sync for endpoints config
I0905 23:13:14.785879       7 shared_informer.go:204] Caches are synced for service config
I0905 23:13:14.785932       7 shared_informer.go:204] Caches are synced for endpoints config

And this is in the aws-node logs:

{"log":"Copying portmap binary ... Starting IPAM daemon in the background ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:26.418457689Z"}
{"log":"Checking for IPAM connectivity ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:46.458122639Z"}
{"log":"Copying additional CNI plugin binaries and config files ... ok.\n","stream":"stdout","time":"2020-09-03T11:06:46.474182395Z"}
{"log":"Foregrounding IPAM daemon ... \n","stream":"stdout","time":"2020-09-03T11:06:46.474202946Z"}
{"log":"ERROR: logging before flag.Parse: W0903 14:22:54.564615       9 reflector.go:341] pkg/mod/k8s.io/client-go@v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109452564 (109453466)\n","stream":"stderr","time":"2020-09-03T14:22:54.564769814Z"}
{"log":"ERROR: logging before flag.Parse: W0903 18:30:26.713005       9 reflector.go:341] pkg/mod/k8s.io/client-go@v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109555468 (109679596)\n","stream":"stderr","time":"2020-09-03T18:30:26.713161405Z"}
{"log":"ERROR: logging before flag.Parse: W0903 18:45:56.655601       9 reflector.go:341] pkg/mod/k8s.io/client-go@v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version: 109679596 (109687399)\n","stream":"stderr","time":"2020-09-03T18:45:56.655715674Z"}

It seems like this has started happening for us as part of the 1.17 upgrade, we haven't restarted all our nodes since the upgrade and I can see that the pods that are still running (on AMI v1.16.12-eks-904af05) the aws-node pod didn't restart:

aws-node-26tfq                               1/1     Running   1          10h
aws-node-2pnwq                               1/1     Running   1          3h33m
aws-node-4f52v                               1/1     Running   1          4d22h
aws-node-5qsll                               1/1     Running   1          5d22h
aws-node-6z6wq                               1/1     Running   0          40d
aws-node-92hvs                               1/1     Running   0          40d
aws-node-c8srx                               1/1     Running   1          5d22h
aws-node-chkhb                               1/1     Running   1          5d4h
aws-node-djlkb                               1/1     Running   0          40d
aws-node-g7drp                               1/1     Running   1          5d5h
aws-node-g9rgn                               1/1     Running   0          40d
aws-node-gbdq5                               1/1     Running   1          2d22h
aws-node-gc5zl                               1/1     Running   1          2d22h
aws-node-hc48d                               1/1     Running   1          5d22h
aws-node-hx9bl                               1/1     Running   1          24d
aws-node-j9dcn                               1/1     Running   1          39d
aws-node-jj4qs                               1/1     Running   1          2d22h
aws-node-kwbjl                               1/1     Running   1          153m
aws-node-ljcv8                               1/1     Running   1          39d
aws-node-lv74f                               1/1     Running   1          12d
aws-node-q2w2w                               1/1     Running   1          2d22h
aws-node-s7qw4                               1/1     Running   1          2d22h
aws-node-tck8w                               1/1     Running   1          5d4h
aws-node-tjhtf                               1/1     Running   1          2d22h
aws-node-tzpb2                               1/1     Running   0          40d
aws-node-vm4nh                               1/1     Running   1          2d22h
aws-node-xnnj2                               1/1     Running   2          153m
aws-node-zchs9                               1/1     Running   1          2d22h

I'm happy to share the full logs if they're helpful, just give me an email address to send them!

jayanthvn · 2020-09-06T19:30:35Z

Hi @mogggggg

Thanks for letting us know. Please kindly share the full logs from the log collector script https://docs.aws.amazon.com/eks/latest/userguide/troubleshooting.html#troubleshoot-cni and also kube-proxy pod logs. You can email it varavaj@amazon.com.

Thanks.

jayanthvn · 2020-09-07T05:31:51Z

Hi @mogggggg

Thanks for sharing the logs. Will review and get back on that asap.

max-rocket-internet · 2020-09-07T08:15:14Z

Could you please provide the kube-proxy configs that you have where this issue shows up?

It's default EKS

Is this on EKS clusters?

Yes

What Kubernetes version?

v1.17.9-eks-4c6976

Do you have some custom PSP for these clusters?

Nope

Are the worker nodes using a custom AMI or have some script locking iptables on startup?

AMI is v20200723 and no custom scripts or except adding some users in user-data.

tibin-mfl · 2020-09-07T08:45:21Z

I am facing a similar issue, aws-node pod restarts 1 time on every node startup, it will work after that.
EKS Version: 1.17
Platform version: eks.2
Kube-proxy: v1.17.9-eksbuild.1
aws-node: v1.6.3-eksbuild.1

@jayanthvn Just wanted to know does aws-node pod depend on Kube-proxy or vice versa?

mogren · 2020-09-08T00:22:59Z

@tibin-mfl Yes, the CNI pod (aws-node) needs kube-proxy to set up the cluster IPs before it can start up.

jayanthvn · 2020-09-10T01:55:20Z

Hi @mogggggg

Sorry for the delayed response, as you have mentioned it looks like kube-proxy is waiting to retrieve node info and during that time frame aws-node starts and is unable to communicate to the API Server because iptables isnt updated and hence it restarts. I will try to repro and we will see how to mitigate this issue.

Thanks for your patience.

focaaby · 2020-09-21T01:05:30Z

Currently, the workaround is adding a busybox as Init Container to wait for the kube-proxy start.

  initContainers:
  - name: init-kubernetes-api
    image: busybox:1.28
    command: ['sh', '-c', "until nslookup kubernetes.default.svc.cluster.local ${KUBE_DNS_PORT_53_TCP_ADDR}; do echo waiting for kubernetes Service endpoint; sleep 2; done"]

tibin-mfl · 2020-09-25T02:04:03Z

I added the initContainer to my aws-node, as a temporary fix, the same was suggested from aws support as well. Now the problem is sometimes aws-node take more than 5 minutes to be up.

sum by(daemonset, namespace) (kube_daemonset_status_number_unavailable{job="kube-state-metrics",namespace=~"kube-system"}) >0

mogren · 2020-09-25T05:37:13Z

@tibin-mfl thanks for reporting this, that is definitely concerning. Do you have kube-proxy logs from any of these nodes? It would be very interesting to see why kube-proxy was taking that long to start up!

snstanton · 2021-03-04T00:05:33Z

Any news on that internal ticket? We keep running into this issue whenever our nodes start.

jayanthvn · 2021-03-23T06:32:03Z

Hi,

Sorry for the delayed response, current findings - NodeIP was added in upstream to determine the IP address family - kubernetes/kubernetes#91725

Hostname is picked here - https://github.com/kubernetes/kubernetes/blob/c88d9bed17bba40da02772a8dccb107f5222efc4/cmd/kube-proxy/app/server_others.go#L129 which is obtained from os.hostname [https://github.com/kubernetes/kubernetes/blob/c88d9bed17bba40da02772a8dccb107f5222efc4/pkg/util/node/node.go#L56]

[ec2-user@ip-192-168-3-233 ~]$ cat /proc/sys/kernel/hostname
ip-192-168-3-233.us-west-2.compute.internal

One option is to set "--bindAddress" to 127.0.0.1 and this determines the address family as v4 and kube-proxy won't wait for getting the node IP but this will break for v6.

We are exploring other options and will update more soon.

kenju · 2021-04-23T13:31:56Z

@jayanthvn Thanks for working on this. Let me share another case.

The version is almost the same with #1078 (comment) (kube-proxy:v1.18.8, amazon-k8s-cni-init:v1.7.10), but our nodes are running on bottlerocket. Not sure whether it is related, though, but we run other clusters with the same version at the same time, and we encounter this issue more often on the nodes with bottlerocket.

schahal · 2021-04-27T15:47:26Z

Another tidbit: We ran into this very issue when upgrading from v1.15 to v1.16

Our current workaround: we're keeping kube-proxy at v1.15.11 even after upgrading rest of cluster to v1.16.

The rest of the add-ons we were able to get to the recommended versions:

Minutis · 2021-04-28T07:28:40Z

I think there is a workaround for this issue:
kubernetes/kubernetes#61486 (comment)

At least for me after applying this I see no more Failed to retrieve node info messages (#1078 (comment)).

s33dunda · 2021-05-25T21:32:20Z

I've recently upgrade from 1.19 -> 1.20 and am seeing these now. however, the aws-nodes never are able to connect.
kube-proxy did have some errors that i've cleared up using ---hostname-override=$(NODE_NAME)
Still no dice for aws-node even after kube-proxy is up and ready. Has anyone else experienced this?

wtchangdm · 2021-07-10T06:01:24Z

@Minutis

I think there is a workaround for this issue:
kubernetes/kubernetes#61486 (comment)

I guess this wouldn't work with EKS's managed addon? We ran into an issue that our coredns configmap was being overwritten constantly. Asked a support and he said managed addon will reconcile every 15 minutes. Since kube-proxy is another managed addon, looks like we will have to wait until aws/containers-roadmap#1415 is done.
(or remove from managed addon and manually apply from manifest)

wtchangdm · 2021-07-11T17:17:51Z

For the hostname issue, I modified self-managed workers' launch template userdata with a dirty hack:

# Adjust according to your exact region
hostname "$(hostname).ap-northeast-1.compute.internal"

then

# Omitted other modifications...
/etc/eks/bootstrap.sh <cluster name>

kube-proxy stoped complaining about Failed to retrieve node info: nodes "ip-10-1-2-3" not found. And CNI won't crash for the first time because of probe failure by the error that OP provided.

Before this change, a worker node waiting for CNI to be ready will be stuck at NotReady state for about 2 minutes. Add this will reduce the time to less than 20 seconds.

If you are using managed kube-proxy addon and don't want to change the deployment for existing clusters (since there will be likely downtime), this seems feasible.

mhulscher · 2021-07-29T12:03:37Z

We are sometimes running into a race-condition where aws-node is started before kube-proxy. Without kube-proxy, kubernetes.default.svc.cluster.local is not available. aws-node will fail to start and the container is not automatically restarted. To mitigate this, we added the following initContainer to aws-node:

      "initContainers":
        - "name": "wait-for-kubernetes-api"
          "image": "curlimages/curl:7.77.0"
          "command":
            - "sh"
            - "-c"
            - |
              while ! timeout 2s curl --fail --silent --cacert /run/secrets/kubernetes.io/serviceaccount/ca.crt "https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/healthz"; do
                echo Waiting on Kubernetes API to respond...
                sleep 1
              done
          "securityContext":
            "runAsUser": 100
            "runAsGroup": 101

Minutis · 2022-02-14T14:28:55Z

Facing the same issue with following 1.18 EKS components. CNI version:

user@user-work-laptop:~$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1

Kube Proxy version:

user@user-work-laptop:~$ kubectl describe daemonset kube-proxy --namespace kube-system | grep Image | cut -d "/" -f 3
kube-proxy:v1.18.8-eksbuild.1

Nodes:

user@user-work-laptop:~$ kubectl get nodes
NAME                                            STATUS   ROLES    AGE    VERSION
<host1>.<region>.compute.internal   Ready    <none>   6d6h   v1.18.9-eks-d1db3c
<host2>.<region>.compute.internal   Ready    <none>   6d6h   v1.18.9-eks-d1db3c
<host3>.<region>.compute.internal   Ready    <none>   6d5h   v1.18.9-eks-d1db3c
<host4>.<region>.compute.internal   Ready    <none>   6d5h   v1.18.9-eks-d1db3c

Ami: ami-0a3d7ac8c4302b317 Pods:

aws-node-rktgg                        1/1     Running   1          3m45s
aws-node-c89ph                        1/1     Running   1          3m20s
kube-proxy-x8t7m                      1/1     Running   0          3m20s
kube-proxy-bfd7x                      1/1     Running   0          3m45s

kube-proxy-bfd7x log:

portRange: ""
udpIdleTimeout: 250ms: v1alpha1.KubeProxyConfiguration.Conntrack: v1alpha1.KubeProxyConntrackConfiguration.ReadObject: found unknown field: max, error found in #10 byte of ...|ck":{"max":0,"maxPer|..., bigger context ...|":"","configSyncPeriod":"15m0s","conntrack":{"max":0,"maxPerCore":32768,"min":131072,"tcpCloseWaitTi|...
I0209 13:41:30.395728       1 feature_gate.go:243] feature gates: &{map[]}
I0209 13:41:30.395787       1 feature_gate.go:243] feature gates: &{map[]}
E0209 13:41:30.988955       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:31.999074       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:34.259534       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:38.355840       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:41:47.022235       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
E0209 13:42:05.550145       1 node.go:125] Failed to retrieve node info: nodes "<host>.<eks_name>.compute.internal" not found
I0209 13:42:05.550167       1 server_others.go:178] can't determine this node's IP, assuming 127.0.0.1; if this is incorrect, please set the --bind-address flag

aws-node-rktgg log:

{"level":"info","ts":"2021-02-09T13:41:33.346Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-02-09T13:41:33.356Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-02-09T13:41:33.357Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E0209 13:42:03.383486       9 memcache.go:138] couldn't get current server API group list; will keep using cached value. (Get https://xxx.xx.x.x:443/api?timeout=32s: dial tcp xxx.xx.x.x:443: i/o timeout)

One thing that catches my eye is that kube-proxy is looking for <host>.<eks_name>.compute.internal while actual hostname is <host1>.<region>.compute.internal.

I found the root cause of this issue (at least for my use-case) - my own fault :) I had set DHCP option set incorrectly to <eks_name>.compute.internal. After setting it to <region>.compute.internal nodes load correctly and fast.

yongzhang · 2022-03-04T07:54:06Z

For the hostname issue, I modified self-managed workers' launch template userdata with a dirty hack:
# Adjust according to your exact region
hostname "$(hostname).ap-northeast-1.compute.internal"
then
# Omitted other modifications...
/etc/eks/bootstrap.sh <cluster name>
kube-proxy stoped complaining about Failed to retrieve node info: nodes "ip-10-1-2-3" not found. And CNI won't crash for the first time because of probe failure by the error that OP provided.

Before this change, a worker node waiting for CNI to be ready will be stuck at NotReady state for about 2 minutes. Add this will reduce the time to less than 20 seconds.

If you are using managed kube-proxy addon and don't want to change the deployment for existing clusters (since there will be likely downtime), this seems feasible.

Does $(hostname) work here? cloud-init is reporting syntax errors:

2022-03-04 07:37:45,306 - util.py[DEBUG]: Failed to non-persistently adjust the system hostname to $(hostname)
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/distros/__init__.py", line 274, in _apply_hostname
    subp.subp(['hostname', hostname])
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 293, in subp
    raise ProcessExecutionError(stdout=out, stderr=err,
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['hostname', '$(hostname)']
Exit code: 1
Reason: -
Stdout:
Stderr: hostname: the specified hostname is invalid

ubuntu@ip-10-120-24-150:~$ cloud-init -v
/usr/bin/cloud-init 21.4-0ubuntu1~20.04.1

Edited:
Found this, shell=False so neither $(hostname) nor $HOSTNAME will work

2022-03-04 08:03:35,719 - subp.py[DEBUG]: Running command ['hostname', '$HOSTNAME'] with allowed return codes [0] (shell=False, capture=True)

Had to add it to runcmd and it works:

runcmd:
  - 'hostnamectl set-hostname $(hostname).<my-aws-region>.compute.internal'

And confirm this change fixed kube-proxy issue, aws-node starts very fast as well.

github-actions · 2022-05-04T00:14:32Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

druchoo · 2022-05-04T12:27:49Z

/remove stale

github-actions · 2022-07-04T00:18:33Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions · 2022-09-21T17:03:34Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

github-actions · 2022-10-06T00:06:25Z

Issue closed due to inactivity.

mogren added bug needs investigation labels Jul 14, 2020

max-rocket-internet changed the title ~~EKS 1.16 / v1.6.1: "couldn't get current server API group list; will keep using cached value"~~ EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value" Jul 15, 2020

achevuru mentioned this issue Sep 2, 2020

Couldn't get resource list for external.metrics.k8s.io/v1beta1 #486

Closed

achevuru mentioned this issue Feb 24, 2021

China VPC CNI error: "memcache.go:138 couldn't get current server API group list" #1389

Closed

jayanthvn mentioned this issue Apr 9, 2021

aws-node pods do not always start successfully with custom networking #1385

Closed

kenju mentioned this issue Apr 15, 2021

udpate aws-k8s-cni to 1.7.10 cookpad/terraform-aws-eks#206

Merged

ettiee mentioned this issue Apr 23, 2021

bind kube-proxy to 127.0.0.1 cookpad/terraform-aws-eks#209

Merged

jayanthvn mentioned this issue Jun 15, 2021

EKS + VPC CNI + Containerd works super slow #1509

Closed

jayanthvn mentioned this issue Oct 22, 2021

aws-node pod does not start correctly the first time #1702

Closed

jayanthvn mentioned this issue Mar 4, 2022

aws-node is restarting (Crashing, exiting on 137) sporadically which causes all pods on that node to stuck on ContainerCreating state. #1425

Closed

jayanthvn mentioned this issue Mar 15, 2022

aws-node amazon-k8s-cni:v1.10.2-eksbuild.1 restarts always on start #1930

Closed

github-actions bot added the stale Issue or PR is stale label May 4, 2022

github-actions bot removed the stale Issue or PR is stale label May 5, 2022

github-actions bot added stale Issue or PR is stale and removed stale Issue or PR is stale labels Jul 4, 2022

github-actions bot added the stale Issue or PR is stale label Sep 21, 2022

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 6, 2022

EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value" #1078

EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value" #1078

Comments

max-rocket-internet commented Jul 9, 2020

max-rocket-internet commented Jul 10, 2020

max-rocket-internet commented Jul 15, 2020

mogren commented Jul 15, 2020

max-rocket-internet commented Jul 17, 2020

hristov-hs commented Jul 24, 2020 • edited Loading

max-rocket-internet commented Jul 29, 2020

max-rocket-internet commented Aug 4, 2020

mogren commented Aug 4, 2020

schmitz-chris commented Aug 7, 2020

marcelbirkner commented Aug 7, 2020 • edited Loading

Niksko commented Sep 1, 2020 • edited Loading

max-rocket-internet commented Sep 2, 2020

mogren commented Sep 2, 2020

Niksko commented Sep 3, 2020

tibin-mfl commented Sep 4, 2020

jayanthvn commented Sep 4, 2020

mogopz commented Sep 6, 2020 • edited Loading

jayanthvn commented Sep 6, 2020

jayanthvn commented Sep 7, 2020

max-rocket-internet commented Sep 7, 2020

tibin-mfl commented Sep 7, 2020

mogren commented Sep 8, 2020

jayanthvn commented Sep 10, 2020

focaaby commented Sep 21, 2020

tibin-mfl commented Sep 25, 2020 • edited Loading

mogren commented Sep 25, 2020

snstanton commented Mar 4, 2021

jayanthvn commented Mar 23, 2021

kenju commented Apr 23, 2021

schahal commented Apr 27, 2021

Minutis commented Apr 28, 2021

s33dunda commented May 25, 2021

wtchangdm commented Jul 10, 2021 • edited Loading

wtchangdm commented Jul 11, 2021

mhulscher commented Jul 29, 2021

Minutis commented Feb 14, 2022 • edited Loading

yongzhang commented Mar 4, 2022 • edited Loading

github-actions bot commented May 4, 2022

druchoo commented May 4, 2022

github-actions bot commented Jul 4, 2022

github-actions bot commented Sep 21, 2022

github-actions bot commented Oct 6, 2022

hristov-hs commented Jul 24, 2020 •

edited

Loading

marcelbirkner commented Aug 7, 2020 •

edited

Loading

Niksko commented Sep 1, 2020 •

edited

Loading

mogopz commented Sep 6, 2020 •

edited

Loading

tibin-mfl commented Sep 25, 2020 •

edited

Loading

wtchangdm commented Jul 10, 2021 •

edited

Loading

Minutis commented Feb 14, 2022 •

edited

Loading

yongzhang commented Mar 4, 2022 •

edited

Loading