Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing liveness Probe #1038

Closed
ltagliamonte-dd opened this issue Jun 16, 2020 · 25 comments
Closed

Failing liveness Probe #1038

ltagliamonte-dd opened this issue Jun 16, 2020 · 25 comments

Comments

@ltagliamonte-dd
Copy link

We are recording an high number of aws cni container restart.
Screen Shot 2020-06-16 at 9 41 05 AM
Searching our splunk logs didn't led us to any particular error.
looking at the kubernetes events for kube-system namespace we can see that the pods are failing liveness probe:

107s        Warning   Unhealthy       pod/aws-node-2p68j   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.812Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
107s        Warning   Unhealthy       pod/aws-node-2p68j   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:58:23.867Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m         Warning   Unhealthy       pod/aws-node-4zjf7   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:41:27.975Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
18m         Warning   Unhealthy       pod/aws-node-4zjf7   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:41:30.484Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:34:21.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
25m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:34:23.817Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:35:11.820Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
24m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:35:13.828Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m         Warning   Unhealthy       pod/aws-node-7c6zn   Liveness probe failed: {"level":"info","ts":"2020-06-16T15:36:11.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
23m         Warning   Unhealthy       pod/aws-node-7c6zn   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:36:13.821Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}
32s         Warning   Unhealthy       pod/aws-node-b6szl   Readiness probe failed: {"level":"info","ts":"2020-06-16T15:59:38.295Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service \":50051\" within 1s"}

our setting looks like:

  • kubernetes 1.16.9 provisioned via kops (~ 850 nodes)
  • aws cni 1.6.1 runs without resource limit
  • liveness and readiness probe:
      Requests:
      cpu:      10m
    Liveness:   exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3
    Readiness:  exec [/app/grpc-health-probe -addr=:50051] delay=35s timeout=1s period=10s #success=1 #failure=3

the liveness probe seems to behave correctly when invoked from the container:

/app/grpc-health-probe -addr=:50051
{"level":"info","ts":"2020-06-16T16:12:25.209Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"status: SERVING"}

Would really appreciate any help in continuing this investigation.
Shall we just relax timeout specs? (would love to hear from other folks)

@mogren
Copy link
Contributor

mogren commented Jun 16, 2020

Hi @ltagliamonte-dd,

There is definitely a chance that you are getting hit by issue #1008. I did a fix for this instance metadata in #1011, but right now we only have a pre-release build v1.6.3-rc1 with that fix included. We are targeting a final release before the end of the week.

If you could run the aws-cni-support.sh script on a node with issues that would help confirm the problem.

@mogren
Copy link
Contributor

mogren commented Jun 16, 2020

Since the cluster is fairly large, there is also a risk of getting throttled by the EC2 API. If that's the case, you should see something like this in the logs:

[ERROR] Failed to get ENI eni-0eXXXXXXXXXXXXXX information from EC2 control plane RequestLimitExceeded: Request limit exceeded.

If the aws-node pods come up after that initial restart, the node should become ready.

@SaranBalaji90
Copy link
Contributor

SaranBalaji90 commented Jun 19, 2020

@ltagliamonte-dd @mogren while we investigate, I'm just wondering to stop aws-node pod from recycling, can we try increasing the timeout period from 1s to 5s?

https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#configure-probes

@mogren
Copy link
Contributor

mogren commented Jun 19, 2020

Hmm, I do think ipamd should always reply within 1 second, but we could increase the failureThreshold from the default of 3 to something like 5 or 10 times, to give it a few more seconds before it gets restarted.

@ltagliamonte-dd
Copy link
Author

@mogren I've continued the investigation on my end and I noticed that my server was having a lot of ARP cache overflow, it is been 2 days i'm running with (I have a VPC with two \16 CIDRs):

net.ipv4.neigh.default.gc_thresh2 = 65534 (50% of 2 VPC)
net.ipv4.neigh.default.gc_thresh3 = 104854 (80% of 2 VPC)

and I've got no sign of failing liveness probes.
Will keep monitor until end of the next week and will report back here.

@mogren
Copy link
Contributor

mogren commented Jun 19, 2020

@ltagliamonte-dd Great, thanks a lot for the update!

@ltagliamonte-dd
Copy link
Author

ltagliamonte-dd commented Jun 27, 2020

After a week of monitoring I'm not recording any restart or liveness probe fail anymore, considering this issue closed.
Thank you @mogren for the support.

@mogren
Copy link
Contributor

mogren commented Jun 27, 2020

@ltagliamonte-dd Thanks a lot for the update, and feel free to ping me any time if you find some issue 😄

@hervems
Copy link

hervems commented Apr 13, 2021

Hi, I have the same problem today, the cause is for me like I can see in /var/log/aws-routed-eni/ipamd.log in the node containing the concerned pod:

{"level":"error","ts":"2021-04-13T16:34:05.272Z","caller":"aws-k8s-agent/main.go:28","msg":"Initialization failure: 
failed to allocate one IP addresses on ENI eni-04263466a7d86c036, err: allocate IP address:
failed to allocate a private IP address: 
InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.\n\tstatus code: 400, request id: f32133ee-7bc9-4259-b1fa-5a21b39e7165: allocate IP address:
failed to allocate a private IP address: 
InsufficientFreeAddressesInSubnet: The specified subnet does not have enough free addresses to satisfy the request.\n\tstatus code: 400, request id: f32133ee-7bc9-4259-b1fa-5a21b39e7165"}

So, I have enough free addresses in a subnet. Perhaps, it can help other people that have the same problem.

@jayanthvn
Copy link
Contributor

Hi @hervems

This is with which CNI version and do you have per pod security groups enabled?

Thanks!

@hervems
Copy link

hervems commented Apr 13, 2021

kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2 :

amazon-k8s-cni-init:v1.7.5-eksbuild.1
amazon-k8s-cni:v1.7.5-eksbuild.1

and

kubectl describe daemonset aws-node -n kube-system | grep -i "ENABLE_POD_ENI"

ENABLE_POD_ENI: false

@jayanthvn
Copy link
Contributor

If you can email me (varavaj@amazon.com) the logs from this script - sudo bash /opt/cni/bin/aws-cni-support.sh. I can verify why the IPs are not released. Also can please share me the number of pods running on this instance.

Thanks!

@vinothav
Copy link

Hi Team,

I am facing the same issue, liveliness and readiness probe failure on aws-node causing them to crashloop and pods/svc hosted on the node being un-available.

Events:
`| Type | Reason | Age | From | Message|


Normal Scheduled 39m default-scheduler Successfully assigned kube-system/aws-node-qct2m to xxxxxxxxxxxxxx
Normal Pulling 37m kubelet Pulling image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10"
Normal Pulled 37m kubelet Successfully pulled image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni-init:v1.7.10" in 690.175934ms
Normal Created 37m kubelet Created container aws-vpc-cni-init
Normal Started 36m kubelet Started container aws-vpc-cni-init
Normal Pulled 36m kubelet Successfully pulled image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10" in 300.589066ms
Warning Unhealthy 35m kubelet Readiness probe failed: {"level":"info","ts":"2021-04-20T14:53:01.143Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}
Warning Unhealthy 35m kubelet Readiness probe failed: {"level":"info","ts":"2021-04-20T14:53:31.943Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}
Warning Unhealthy 35m kubelet Liveness probe failed: {"level":"info","ts":"2021-04-20T14:53:47.560Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}
Warning Unhealthy 34m kubelet Readiness probe failed: {"level":"info","ts":"2021-04-20T14:53:57.051Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}
Warning Unhealthy 34m kubelet Liveness probe failed: {"level":"info","ts":"2021-04-20T14:54:16.743Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}
Warning Unhealthy 34m kubelet Liveness probe failed: {"level":"info","ts":"2021-04-20T14:54:42.655Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}
Normal Killing 34m kubelet Container aws-node failed liveness probe, will be restarted
Warning Unhealthy 34m kubelet Readiness probe failed: {"level":"info","ts":"2021-04-20T14:54:39.438Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}
Warning Unhealthy 33m kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: read init-p: connection reset by peer: unknown
Warning Unhealthy 33m kubelet Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:370: starting container process caused: process_linux.go:103: executing setns process caused: exit status 1: unknown
Normal Pulling 33m (x2 over 36m) kubelet Pulling image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10"
Normal Pulled 33m kubelet Successfully pulled image "602401143452.dkr.ecr.eu-west-1.amazonaws.com/amazon-k8s-cni:v1.7.10" in 619.512059ms
Normal Created 33m (x2 over 36m) kubelet Created container aws-node
Normal Started 33m (x2 over 36m) kubelet Started container aws-node
Warning Unhealthy 33m kubelet Readiness probe errored: rpc error: code = Unknown desc = container not running (a79647cd4b740690c9fb3211b40f9f183e3b4faf8e3efdcd7302eebdbb699b6c)
Warning Unhealthy 2m6s (x77 over 33m) kubelet (combined from similar events): Liveness probe failed: {"level":"info","ts":"2021-04-20T15:26:42.360Z","caller":"/usr/local/go/src/runtime/proc.go:203","msg":"timeout: failed to connect service ":50051" within 10s"}`

EKS version: 1.19

Node version: v1.19.6-eks-49a6c0

Instance type: t3a.large

kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.10
amazon-k8s-cni:v1.7.10

kubectl describe daemonset aws-node -n kube-system | grep -i "ENABLE_POD_ENI"
ENABLE_POD_ENI: false

As a workaround I tried to increase the connect timeout for the liveliness and readiness probe command exec, but still doesnt work.
Liveness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=60s timeout=10s period=10s #success=1 #failure=3 Readiness: exec [/app/grpc-health-probe -addr=:50051 -connect-timeout=10s] delay=15s timeout=10s period=10s #success=1 #failure=3

The issue is not related to #1360 as I checked the Loopback interface on a node with failing aws-node:
ifconfig -a

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10<host> loop txqueuelen 1000 (Local Loopback) RX packets 133982 bytes 9506349 (9.0 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 133982 bytes 9506349 (9.0 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Kindly let me know if you need more information.

@vinothav
Copy link

This is what ipamd log states:
cat ipamd.log | grep error {"level":"error","ts":"2021-04-20T14:17:02.797Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T14:32:17.001Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:931","msg":"Failed to retrieve interfaces data from instance metadata RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-20T16:32:31.709Z","caller":"ipamd/ipamd.go:494","msg":"IP pool reconcile: Failed to get attached ENI info: get attached ENIs: failed to retrieve interfaces data: RequestError: send request failed\ncaused by: Get http://169.254.169.254/latest/meta-data/network/interfaces/macs/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"} {"level":"error","ts":"2021-04-21T03:07:53.689Z","caller":"awsutils/awsutils.go:416","msg":"Failed to retrieve vpc-ipv4-cidr-blocks from instance metadata service"}

FYI: the nodes are self managed nodes

@anwarchk
Copy link

anwarchk commented Jun 5, 2021

/reopen

@alaa
Copy link

alaa commented Nov 3, 2021

Having the same issue with amazon-k8s-cni:v1.9.0 and amazon-k8s-cni:v1.9.3

@jayanthvn
Copy link
Contributor

@alaa - Which k8s version you are on? There is a known issue with K8s 1.21 - Ref: #1425 (comment)

@alaa
Copy link

alaa commented Nov 4, 2021

@jayanthvn Oh, I am using EKS 1.21

@muruguru
Copy link

@jayanthvn I am using EKS 1.21 and tried both v1.9.3 and v1.10.1 of CNI and see the same issue as well.
amazon-k8s-cni-init:v1.9.3-eksbuild.1

@crechandan
Copy link

crechandan commented Dec 8, 2021

Any update on this issue :: I am also facing similar issue

  • EKS Version :: 1.20.7
  • AWS CNI Versions: amazon-k8s-cni-init:v1.7.5-eksbuild.1 // amazon-k8s-cni:v1.7.5-eksbuild.1

I am deploying EKS add on using cloud formation script - below is the snippet ::

EksClusterAddon:
Type: AWS::EKS::Addon
Properties:
AddonName: !Ref EksAddonName
ClusterName: !Ref EksClusterName
ResolveConflicts: OVERWRITE
ServiceAccountRoleArn: !Ref EksServiceRoleArn

I am not providing any addonversion and by default above version is getting picked up by AWS.

Logs for aws-node pod ::

{"level":"info","ts":"2021-12-08T13:01:00.915Z","caller":"entrypoint.sh","msg":"Install CNI binary.."}
{"level":"info","ts":"2021-12-08T13:01:00.940Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2021-12-08T13:01:00.948Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
ERROR: logging before flag.Parse: E1208 13:01:01.348599 9 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
ERROR: logging before flag.Parse: E1208 13:02:01.442917 9 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

image

@yongzhang
Copy link

same issue, posted comment here: #1425 (comment)

@s7an-it
Copy link

s7an-it commented Feb 4, 2022

@crechandan, I am on same setup facing same issues, did you manage to find a resolution for that setup?

@crechandan
Copy link

@crechandan, I am on same setup facing same issues, did you manage to find a resolution for that setup?

@s7an-it No as of now - I am unable to find any resolution for the same (using AWS CloudFormation : EKS addon). As a workaround - I am installing the aws-cni with the help of simple kubectl command >> calling it in Ansible playbook (running on an EC2 instance having access to EKS cluster).

Follow page --> https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html#updating-vpc-cni-eks-add-on

This is the reason why : i wanted to manage ClusterAddOn via CloudFormation (to get rid of this workaround).

@berry2012
Copy link

I am able to reproduce similar issue.

kubectl describe pod aws-node-sgp4c -n kube-system

Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:07.661Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:14.960Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:24.957Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  10m                   kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:34.957Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m52s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:44.970Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m42s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:42:54.965Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m32s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:43:04.966Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m22s                 kubelet            Liveness probe failed: {"level":"info","ts":"2022-03-17T12:43:14.983Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}
  Warning  Unhealthy  9m21s                 kubelet            Readiness probe failed: {"level":"info","ts":"2022-03-17T12:43:15.045Z","caller":"/usr/local/go/src/runtime/proc.go:225","msg":"timeout: failed to connect service \":50051\" within 5s"}

kubectl logs aws-node-sgp4c -n kube-system

{"level":"info","ts":"2022-03-17T12:58:06.329Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-03-17T12:58:06.330Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-03-17T12:58:06.350Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-03-17T12:58:06.352Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-03-17T12:58:08.362Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:10.370Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:12.378Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:14.387Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:16.397Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-03-17T12:58:18.406Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

debugging from the node
tail -f /var/log/aws-routed-eni/ipamd.log | grep error

{"level":"error","ts":"2022-03-17T13:06:14.594Z","caller":"ipamd/ipamd.go:456","msg":"Failed to call ec2:DescribeNetworkInterfaces for [eni-0b8d9XXXXX0ff09 eni-0681XXXXX2427]: WebIdentityErr: failed to retrieve credentials\ncaused by: InvalidIdentityToken: No OpenIDConnect provider found in your account for https://oidc.eks.eu-west-1.amazonaws.com/id/97043D9XXXXXXXXC65E8E946\n\tstatus code: 400, request id: d71dd6e4-a65f-4247-b21b-625d6d1d2787"}

Solution:

aws iam list-open-id-connect-providers | grep 97043D9XXXXXXXXC65E8E946
If output is returned from the previous command, then you already have a provider for your cluster. If no output is returned, then you must create an IAM OIDC provider.

eksctl utils associate-iam-oidc-provider --cluster EKS-cluster-name --approve

Delete to recreate the pods via its daemonset

kubectl delete pods -n kube-system -l k8s-app=aws-node

Confirm it is running

kubectl get pods -n kube-system  -l k8s-app=aws-node
NAME             READY   STATUS    RESTARTS   AGE
aws-node-cnpq8   1/1     Running   0          19s

@rafaeldasilva
Copy link

If you are using sts vpc endpoint check the variables AWS_STS_REGIONAL_ENDPOINTS, AWS_WEB_IDENTITY_TOKEN_FILE and AWS_ROLE_ARN

kubectl describe pod -n kube-system aws-node-xxxx | grep 'AWS_ROLE_ARN:\|AWS_WEB_IDENTITY_TOKEN_FILE:\|AWS_STS_REGIONAL_ENDPOINTS'

From: https://docs.aws.amazon.com/eks/latest/userguide/cni-iam-role.html

Also check the subnets at the vpc-endpoint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests