-
Notifications
You must be signed in to change notification settings - Fork 748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNI fails healthchecks, errors when we run containers #1590
Comments
Hi @korotovsky Regarding [1], the primary ENI log you are seeing is because of reconciler which runs every 5 seconds - amazon-vpc-cni-k8s/pkg/ipamd/ipamd.go Line 548 in be5d0b6
Reg [2], I see you are using k8s 1.21, can you please check if you are having this issue- #1425 (comment). This would need timeoutSeconds of the aws-node livenessProbe to be set to 5s. |
@jacksontj Thanks for the hint, I'll try it, I also noticed in the logs the following:
This goes in combination with these errors about net deletion/etc related to cron container. And I'm wondering, what if setting probes timeout will just "mask" the real issue? Maybe the container that CNI expects to see in cache/etc should not be deleted so early? I mean, what if somehow keeping the cronjob that has been executed and its container a bit longer would help? I did not try just curious to fact check this and maybe if would be helpful for others too in similar scenarios UPDATE: I'm pretty sure there are high chances to reproduce similar error if you create ~4-5 CronJobs with interval every minute that does its job fairly fast to be ready to next run. |
Because of the timeout aws-node will restart, CNI is unable to process kubelet requests because CNI is not able to connect to IPAMD and errors out. Kubelet will retry until the operation is successful. Without aws-node restarting/timeout if the issue happens then it might be IPAMD issue but here it looks like it is expected. Is it possible to increase the timeout and verify or if possible to try on any older k8s cluster (1.16 or 1.17) with CNI 1.9 and try similar cronjobs?
Until CNI sends a del call to IPAMD, local cache of IPAMD won't be cleaned up. |
@jayanthvn I see, okay I increased the timeouts at the moment and will observe with enabled cronjobs. Unfortunately I'm now a bit limited in time because I spent so much time on the investigation of the issue already and I could not try it out with older versions, I'm sorry! I'll keep this issue updated about the results after applying increased timeouts, thanks for your help. |
@korotovsky no problem, I understand. Thanks so much for the deep dive :) |
@jayanthvn Yesterday honestly I could not manage it and did rollback to eks 1.19 + cni 1.9 too but there were same errors, then I installed cni 1.8 and so far it works stable on eks 1.19. Maybe later I'll try upgrade to eks 1.21 with cni 1.8, but it seems that in my workload cni 1.9 is buggy a bit even with increased timeouts. |
@korotovsky Thanks for trying. I will also try repro - [~4-5 CronJobs with interval every minute]. Next time when you hit the issue, will you please be able to share me the logs from one of the nodes which has the issue. To collect the logs you can run this script on the node - |
@jayanthvn I tried this script when I came to create this issue and checked what was gathered, then realized that there was a bit too much sensitive data collected, e.g. commands for sidecars and so on, container names from private ecr/etc, we don't really want to share it publicly. Is there any way to share these logs privately? Meanwhile I also found slight confirmation of my theory about CronJobs:
When I said yesterday that my workload worked "stable" it was one deployed app with one CronJob every minute. Today I made other things much stable and deployed rest 5 and overall got workload of 6 pretty similar apps each has one CronJob so I started to receive "context deadline exceeded" errors (However aws-cni did not report any health-check errors) on new deployments and some I even could not completely deploy. My plan is the following:
So far I'm pretty sure issues are coming from excessive cronjobs run with short interval. Here is my "simplified" CronJob, feel free to try it out with your own container:
|
@korotovsky Sorry missed sharing my email id, you can send it to varavaj@amazon.com. This is great thanks for all the info. I will try this cron job locally. |
@korotovsky - sorry for the delay. I haven't received the logs. Can you please run this script - |
I am trying to reproduce a bug where pods end up in init state because aws-node gets into a loop due to "caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}. |
Update reboots and restart of kubelet didnt help. |
Update 3: while terminate fixed the issue I got 1 primary + extra eni after terminate I did only terminte single node but fixed all 3 failing. Dos it means the were in somekind of shared group or something? Aside from that this looks like an issue to release ips or something related to eni release and reconsile, probaby if I release ips and let it refill them I might not even need to terminate? Ultimately the postive thing I can schedule extra node and then move all traffic pods that already work with no downtime and only terminate that one taht is already suffering init loop. |
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days |
Issue closed due to inactivity. |
Hello, we are experiencing errors in aws cni, in particular from logs I can point out two possible major issues:
At some point it ended up with this:
What you expected to happen:
Passing probes, normal operation.
How to reproduce it (as minimally and precisely as possible):
I'm afraid I could not provide any
Anything else we need to know?:
We are running CronJobs every minute and I had an idea that it could be an issue? Maybe we should adjust some settings of CNI for such utilization?
Environment:
kubectl version
):cat /etc/os-release
):uname -a
):The text was updated successfully, but these errors were encountered: