-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EKS 1.16 / v1.6.x: "couldn't get current server API group list; will keep using cached value" #1078
Comments
There's quite a few similar issues but they aren't exactly the same error. Not sure if this is a duplicate issue or not... #1049 (comment) (logs don't match) |
Same for |
@max-rocket-internet Thanks for testing with v1.6.3 as well. I suspect the issue here is that kube-proxy have not yet set up the iptables rules to resolve I guess a work-around to avoid the restarts would be to retry a few times before returning an error, instead of letting kubelet do the retries. That would at least hide these errors from the logs. |
Makes sense but why wasn't this a problem with |
we had the same issue after upgrading from eks 1.15 to 1.16. We were just bumping the image version inside DaemonSet to 1.6.X. What solved our issue is to apply the full yaml provided by aws doc: it did changes both to the DaemonSet and the ClusterRole. Good luck ! |
But there's no changes to that file between the |
Any update @mogren? |
@max-rocket-internet Hey, sorry for the lack of updates on this. Been out for a bit without much network access, so haven't been able to track this one down. I agree that there is no config change between v1.6.2 and v1.6.3, but since v1.5.x, we have updated the readiness and liveness probe configs. Between kubernetes 1.15 and 1.16 kube-proxy has changed, so that could be related. We have not yet been able to reproduce this yet doing master upgrades. |
We had the same problem. Updating master and nodegroups from 1.15 to 1.16. We had to fix the version of kubeproxy |
^^
|
Just an FYI, we were encountering this issue on k8s 1.17, kube-proxy 1.16.13, aws cni 1.6.3 and 1.7.1. Turns out the issue was a bad PSP for |
So two suggestions now:
Any confirmation from AWS about the issue and resolution? |
@max-rocket-internet @Niksko Could you please provide the kube-proxy configs that you have where this issue shows up? Is this on EKS clusters? What Kubernetes version? Do you have some custom PSP for these clusters? Are the worker nodes using a custom AMI or have some script locking iptables on startup? Is there something you know that changed in 1.16.13 that makes starting kube-proxy take slightly longer time, triggering this issue? |
@mogren this is on EKS, version 1.17. We discovered this as part of adding custom PSPs to all components. No scripts locking iptables on startup, using the standard EKS AMIs. The behaviour we were seeing was that the aws-node pod never became ready, and was crash-looping. Apologies if that caused any confusion. I think it's not unreasonable to conclude that:
|
I am facing a similar issue, aws-node pod restarts 1 time on every node startup, it will work after that.
EKS Version: 1.17 I tried adding sleep in aws-node to rule that this is happening because kube-proxy is taking time to start, verified that kube-proxy started before aws-node. |
Hi @tibin-mfl Can you please share cni logs - Thanks. |
Hi, just wanted to chime in and we're seeing the same thing. Like others have mentioned the pod seems to restart once when the node first starts up and it's fine after that. We're not using any custom PSPs. EKS version: I can see these errors in
And this is in the
It seems like this has started happening for us as part of the 1.17 upgrade, we haven't restarted all our nodes since the upgrade and I can see that the pods that are still running (on AMI
I'm happy to share the full logs if they're helpful, just give me an email address to send them! |
Hi @mogggggg Thanks for letting us know. Please kindly share the full logs from the log collector script Thanks. |
Hi @mogggggg Thanks for sharing the logs. Will review and get back on that asap. |
It's default EKS
Yes
Nope
AMI is |
@jayanthvn Just wanted to know does aws-node pod depend on Kube-proxy or vice versa? |
@tibin-mfl Yes, the CNI pod (aws-node) needs kube-proxy to set up the cluster IPs before it can start up. |
Hi @mogggggg Sorry for the delayed response, as you have mentioned it looks like kube-proxy is waiting to retrieve node info and during that time frame aws-node starts and is unable to communicate to the API Server because iptables isnt updated and hence it restarts. I will try to repro and we will see how to mitigate this issue. Thanks for your patience. |
Currently, the workaround is adding a busybox as Init Container to wait for the kube-proxy start.
|
I added the initContainer to my
|
@tibin-mfl thanks for reporting this, that is definitely concerning. Do you have kube-proxy logs from any of these nodes? It would be very interesting to see why kube-proxy was taking that long to start up! |
Any news on that internal ticket? We keep running into this issue whenever our nodes start. |
Hi, Sorry for the delayed response, current findings - NodeIP was added in upstream to determine the IP address family - kubernetes/kubernetes#91725 Hostname is picked here - https://github.com/kubernetes/kubernetes/blob/c88d9bed17bba40da02772a8dccb107f5222efc4/cmd/kube-proxy/app/server_others.go#L129 which is obtained from
One option is to set "--bindAddress" to 127.0.0.1 and this determines the address family as v4 and kube-proxy won't wait for getting the node IP but this will break for v6. We are exploring other options and will update more soon. |
@jayanthvn Thanks for working on this. Let me share another case. The version is almost the same with #1078 (comment) ( |
Another tidbit: We ran into this very issue when upgrading from Our current workaround: we're keeping kube-proxy at v1.15.11 even after upgrading rest of cluster to v1.16. The rest of the add-ons we were able to get to the recommended versions: |
I think there is a workaround for this issue: At least for me after applying this I see no more |
I've recently upgrade from 1.19 -> 1.20 and am seeing these now. however, the aws-nodes never are able to connect. |
I guess this wouldn't work with EKS's managed addon? We ran into an issue that our coredns configmap was being overwritten constantly. Asked a support and he said managed addon will reconcile every 15 minutes. Since kube-proxy is another managed addon, looks like we will have to wait until aws/containers-roadmap#1415 is done. |
For the hostname issue, I modified self-managed workers' launch template userdata with a dirty hack: # Adjust according to your exact region
hostname "$(hostname).ap-northeast-1.compute.internal" then # Omitted other modifications...
/etc/eks/bootstrap.sh <cluster name> kube-proxy stoped complaining about Before this change, a worker node waiting for CNI to be ready will be stuck at NotReady state for about 2 minutes. Add this will reduce the time to less than 20 seconds. If you are using managed kube-proxy addon and don't want to change the deployment for existing clusters (since there will be likely downtime), this seems feasible. |
We are sometimes running into a race-condition where aws-node is started before kube-proxy. Without kube-proxy, kubernetes.default.svc.cluster.local is not available. aws-node will fail to start and the container is not automatically restarted. To mitigate this, we added the following initContainer to aws-node: "initContainers":
- "name": "wait-for-kubernetes-api"
"image": "curlimages/curl:7.77.0"
"command":
- "sh"
- "-c"
- |
while ! timeout 2s curl --fail --silent --cacert /run/secrets/kubernetes.io/serviceaccount/ca.crt "https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/healthz"; do
echo Waiting on Kubernetes API to respond...
sleep 1
done
"securityContext":
"runAsUser": 100
"runAsGroup": 101 |
I found the root cause of this issue (at least for my use-case) - my own fault :) I had set DHCP option set incorrectly to |
Does
Edited:
Had to add it to
And confirm this change fixed kube-proxy issue, aws-node starts very fast as well. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
/remove stale |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
Issue closed due to inactivity. |
We see the
aws-node
pods crash on startup sometimes with this logged:After starting and crashing the pod is then restarted and runs fine. About half of the
aws-node
pods do this.The text was updated successfully, but these errors were encountered: