-
Notifications
You must be signed in to change notification settings - Fork 759
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failing liveness Probe #1038
Comments
Hi @ltagliamonte-dd, There is definitely a chance that you are getting hit by issue #1008. I did a fix for this instance metadata in #1011, but right now we only have a pre-release build If you could run the |
Since the cluster is fairly large, there is also a risk of getting throttled by the EC2 API. If that's the case, you should see something like this in the logs:
If the |
@ltagliamonte-dd @mogren while we investigate, I'm just wondering to stop aws-node pod from recycling, can we try increasing the timeout period from 1s to 5s? |
Hmm, I do think ipamd should always reply within 1 second, but we could increase the |
@mogren I've continued the investigation on my end and I noticed that my server was having a lot of ARP cache overflow, it is been 2 days i'm running with (I have a VPC with two \16 CIDRs):
and I've got no sign of failing liveness probes. |
@ltagliamonte-dd Great, thanks a lot for the update! |
After a week of monitoring I'm not recording any restart or liveness probe fail anymore, considering this issue closed. |
@ltagliamonte-dd Thanks a lot for the update, and feel free to ping me any time if you find some issue 😄 |
Hi, I have the same problem today, the cause is for me like I can see in
So, I have enough free addresses in a subnet. Perhaps, it can help other people that have the same problem. |
Hi @hervems This is with which CNI version and do you have per pod security groups enabled? Thanks! |
and
|
If you can email me (varavaj@amazon.com) the logs from this script - Thanks! |
Hi Team, I am facing the same issue, liveliness and readiness probe failure on aws-node causing them to crashloop and pods/svc hosted on the node being un-available. Events: Normal Scheduled 39m default-scheduler Successfully assigned kube-system/aws-node-qct2m to xxxxxxxxxxxxxx EKS version: 1.19 Node version: v1.19.6-eks-49a6c0 Instance type: t3a.large
As a workaround I tried to increase the connect timeout for the liveliness and readiness probe command exec, but still doesnt work. The issue is not related to #1360 as I checked the Loopback interface on a node with failing aws-node:
Kindly let me know if you need more information. |
This is what ipamd log states: FYI: the nodes are self managed nodes |
/reopen |
Having the same issue with |
@alaa - Which k8s version you are on? There is a known issue with K8s 1.21 - Ref: #1425 (comment) |
@jayanthvn Oh, I am using EKS 1.21 |
@jayanthvn I am using EKS 1.21 and tried both v1.9.3 and v1.10.1 of CNI and see the same issue as well. |
Any update on this issue :: I am also facing similar issue
I am deploying EKS add on using cloud formation script - below is the snippet :: EksClusterAddon: I am not providing any addonversion and by default above version is getting picked up by AWS. Logs for aws-node pod :: {"level":"info","ts":"2021-12-08T13:01:00.915Z","caller":"entrypoint.sh","msg":"Install CNI binary.."} |
same issue, posted comment here: #1425 (comment) |
@crechandan, I am on same setup facing same issues, did you manage to find a resolution for that setup? |
@s7an-it No as of now - I am unable to find any resolution for the same (using AWS CloudFormation : EKS addon). As a workaround - I am installing the aws-cni with the help of simple kubectl command >> calling it in Ansible playbook (running on an EC2 instance having access to EKS cluster). Follow page --> https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on This is the reason why : i wanted to manage ClusterAddOn via CloudFormation (to get rid of this workaround). |
I am able to reproduce similar issue.
debugging from the node
Solution:
Delete to recreate the pods via its daemonset
Confirm it is running
|
If you are using sts vpc endpoint check the variables AWS_STS_REGIONAL_ENDPOINTS, AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN
From: https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html Also check the subnets at the vpc-endpoint |
We are recording an high number of aws cni container restart.

Searching our splunk logs didn't led us to any particular error.
looking at the kubernetes events for kube-system namespace we can see that the pods are failing liveness probe:
our setting looks like:
the liveness probe seems to behave correctly when invoked from the container:
Would really appreciate any help in continuing this investigation.
Shall we just relax timeout specs? (would love to hear from other folks)
The text was updated successfully, but these errors were encountered: