Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPAMD fails to start #1847

Closed
grumpymatt opened this issue Feb 4, 2022 · 62 comments
Closed

IPAMD fails to start #1847

grumpymatt opened this issue Feb 4, 2022 · 62 comments
Labels

Comments

@grumpymatt
Copy link

grumpymatt commented Feb 4, 2022

What happened:
IPAMD fails to start with iptables error. The aws-node pods fail to start and prevent worker nodes from going ready.
This is occurring after updating to rocky linux 8.5 which is based on rhel 8.5.

/var/log/aws-routed-eni/ipamd.log

{"level":"error","ts":"2022-02-04T14:38:08.239Z","caller":"networkutils/network.go:385","msg":"ipt.NewChain error for chain [AWS-SNAT-CHAIN-0]: running [/usr/sbin/iptables -t nat -N AWS-SNAT-CHAIN-0 --wait]: exit status 3: iptables v1.8.4 (legacy): can't initialize iptables table `nat': Table does not exist (do you need to insmod?)\nPerhaps iptables or your kernel needs to be upgraded.\n"}

POD logs
kubectl logs -n kube-system aws-node-9tqb6

{"level":"info","ts":"2022-02-04T15:11:48.035Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-02-04T15:11:48.036Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-02-04T15:11:48.062Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-02-04T15:11:48.071Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-02-04T15:11:50.092Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:52.103Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:54.115Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-02-04T15:11:56.124Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

Attach logs

What you expected to happen:
Expect ipamd to start normally.

How to reproduce it (as minimally and precisely as possible):
Deploy eks cluster with ami based on Rocky 8.5. In theory any rhel 8.5 could have this problem.

Anything else we need to know?:
Running the iptables command from the ipamd log as root on the worker node works fine.

Environment:

  • Server Version: version.Info{Major:"1", Minor:"19+", GitVersion:"v1.19.15-eks-9c63c4", GitCommit:"9c63c4037a56f9cad887ee76d55142abd4155179", GitTreeState:"clean", BuildDate:"2021-10-20T00:21:03Z", GoVersion:"go1.15.15", Compiler:"gc", Platform:"linux/amd64"}
  • CNI: 1.10.1
  • OS (e.g: cat /etc/os-release):
    NAME="Rocky Linux"
    VERSION="8.5 (Green Obsidian)"
    ID="rocky"
    ID_LIKE="rhel centos fedora"
    VERSION_ID="8.5"
    PLATFORM_ID="platform:el8"
    PRETTY_NAME="Rocky Linux 8.5 (Green Obsidian)"
    ANSI_COLOR="0;32"
    CPE_NAME="cpe:/o:rocky:rocky:8:GA"
    HOME_URL="https://rockylinux.org/"
    BUG_REPORT_URL="https://bugs.rockylinux.org/"
    ROCKY_SUPPORT_PRODUCT="Rocky Linux"
    ROCKY_SUPPORT_PRODUCT_VERSION="8"
  • Kernel (e.g. uname -a): Linux ip-10-2--xx-xxx.ec2.xxxxxxxx.com 4.18.0-348.12.2.el8_5.x86_64 #1 SMP Wed Jan 19 17:53:40 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
@grumpymatt grumpymatt added the bug label Feb 4, 2022
@grumpymatt
Copy link
Author

We found by loading ip_tables, iptable_nat, and iptable_mangle kernel modules fixes the issue: modprobe ip_tables iptable_nat iptable_mangle

Still trying to figure out why these modules where loaded be default in 8.4 and not in 8.5.
Also still not sure why the same iptables commands work without these modules directly on the worker instance and not in the container.

@achevuru
Copy link
Contributor

achevuru commented Feb 9, 2022

We do install iptables by default in aws-node container images. Good to check the changelog between 8.4 & 8.5 for any insights in to the observed behavior.

@vishal0nhce
Copy link

@grumpymatt I have been getting the same issue while setting up EKS on rhel8.5. And after loading the kernel modules, it does work fine. The strange thing is I had tried the same in RHEL8.0 worker nodes and still getting the same issue. It works fine in RHEL7.x, though.

@achevuru achevuru self-assigned this Feb 15, 2022
@achevuru
Copy link
Contributor

achevuru commented Feb 15, 2022

@grumpymatt Since the issue is clearly tied to missing iptables modules, I think we can close this issue. Let us know, if there is any other concern.

@vishal0nhce Yeah, iptables module is required for VPC CNI and not sure why it is missing in rhel8.5. I don't see any specific call out for rhel 8.5 around this.

@grumpymatt
Copy link
Author

We found an alternative way of fixing it by updating iptables inside the CNI container image.

from 602401143452.dkr.ecr.us-east-1.amazonaws.com/amazon-k8s-cni:v1.10.1
run yum install iptables-nft -y
run cd /usr/sbin && rm iptables && ln -s xtables-nft-multi iptables

My concern is the direction of RHEL and downstream distros seems to be away from iptables-legacy and to iptables-nft. Is there any plans to release address this in the CNI container image?

@achevuru
Copy link
Contributor

Interesting. So, RHEL 8 doesn't support iptables-legacy anymore? That explains the issue. I think iptables legacy mode is sort of the default (at least for now) for most distributions and in particular Amazon Linux 2 images use iptables-legacy by default as well. We will track AL2 images for our default CNI builds. Will check and update if there is something we can do to address this scenario.

@bilby91
Copy link

bilby91 commented Apr 4, 2022

We are seeing a similar situation where IPAM-D won't start successfully and the aws-node pod would restart at least once. We are running eks 1.20.

$ cat /etc/os-release
NAME="Amazon Linux"
VERSION="2"
ID="amzn"
ID_LIKE="centos rhel fedora"
VERSION_ID="2"
PRETTY_NAME="Amazon Linux 2"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2"
HOME_URL="https://amazonlinux.com/"

@jayanthvn
Copy link
Contributor

@bilby91 - Can you please check if kube-proxy is taking time to start? Kube-proxy should setup rules for aws-node to reach API Server on startup.

@dhavaln-able
Copy link

Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully. Only VPC CNI image update from 1.9.0 to 1.11.0. Any clue what's wrong with the latest version? TIA

{"level":"info","ts":"2022-04-21T19:44:43.569Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

@kathy-lee
Copy link

Similar error of IPAMD failing to start with latest version v1.11.0. Kube-proxy is already running successfully.
{"level":"info","ts":"2022-04-27T10:07:56.670Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

@js-timbirkett
Copy link

I was seeing this error. In my case, a developer had manually created VPC endpoints for a few services, including STS, resulting in traffic to the services being blackholed. So ipamd could not create a session to collect the information it needed to.

@sahil100122
Copy link

I am also facing the same issue while trying to upgrade the cluster from 1.19 to 1.20 in EKS . Can't pinpoint the exact problem.

@jayanthvn
Copy link
Contributor

@dhavaln-able and @kathy-lee - So with v1.11.0 is aws-node continuously crashing or is it coming up after few restarts?

@sahil100122 - You mean while upgrading kube-proxy is up and running but ipamd is not starting at all?

@smalltown
Copy link

smalltown commented May 8, 2022

I found FlatCar CoreOS also encounter related issue, the iptables command of FlatCar CoreOS version 3033.2.0 uses the nftables kernel backend instead of the iptables backend, that leads to the pod which belong to secondary eni cannot access K8s internal ClusterIP

Thank for @grumpymatt's workaround, after I follow the same way to build customized amazon-k8s-cni container image, currently aws vpc cni works in the version of FlatCar CoreOS greater than 3033.2.0

@rhenry-brex
Copy link

Had same issue while upgrading, but after looking at the trouble shooting guide and patching the daemonset with the following, aws-node came up as expected and without issues.

# New env vars introduced with 1.10.x
- op: add
  path: "/spec/template/spec/initContainers/0/env/-"
  value: {"name": "ENABLE_IPv6", "value": "false"}
- op: add
  path: "/spec/template/spec/containers/0/env/-"
  value: {"name": "ENABLE_IPv4", "value": "true"}
- op: add
  path: "/spec/template/spec/containers/0/env/-"
  value: {"name": "ENABLE_IPv6", "value": "false"}

@varunpalekar
Copy link

I also face above issue, but in my case I am using custom kube-proxy image.
But when I reverted to default kube-proxy image and restart aws-node pods, all things works fine.

Why aws-node ipamd not giving any error related to communication if issue is with kube-proxy 🤔

@trallnag
Copy link

trallnag commented Jul 19, 2022

Had a similar issue yesterday. AWS Systems Manager applied a patch to all of our nodes. This patch required a reboot of the instances. All instances came up healthy, but on three out of five the network was not working basically making the cluster unusable. Investigation lead me to issues like the one here or this AWS Knowledge Center entry from AWS.

Recycling all nodes resolved the issue. Did not try to just terminate the aws-node pods. Interestingly only one out of three clusters was affected. So probably difficult to reproduce.

What I also noticed: Why is aws-node mounting /var/run/dockershim.sock even though we use containerd?

  • AWS Node Image: 602401143452.dkr.ecr.eu-central-1.amazonaws.com/amazon-k8s-cni:v1.10.1-eksbuild.1
  • Default kube-proxy, default aws-node, etc.

@inge4pres
Copy link

Hey all 👋🏼 please be aware that this failure mode happens also when the IPs for a subnet are exhausted.

I just faced this and noticed I had mis-configured my worker groups to use a small subnet (/26) instead of a bigger one I intended to use (/18).

@TaiSHiNet
Copy link

Also: Check you have the right security group attached to your nodes

@esidate
Copy link

esidate commented Sep 9, 2022

For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example:
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml

@jacobhjkim
Copy link

For me, the issue was policy/AmazonEKS_CNI_Policy-2022092909143815010000000b
My policy only allowed IPV6 like below.

{
    "Statement": [
        {
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:AssignIpv6Addresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV6"
        },
        {
            "Action": "ec2:CreateTags",
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Sid": "CreateTags"
        }
    ],
    "Version": "2012-10-17"
}

I changed the policy like below:

{
    "Statement": [
        {
            "Action": [
                "ec2:UnassignPrivateIpAddresses",
                "ec2:ModifyNetworkInterfaceAttribute",
                "ec2:DetachNetworkInterface",
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:DeleteNetworkInterface",
                "ec2:CreateNetworkInterface",
                "ec2:AttachNetworkInterface",
                "ec2:AssignPrivateIpAddresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV4"
        },
        {
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeNetworkInterfaces",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceTypes",
                "ec2:AssignIpv6Addresses"
            ],
            "Effect": "Allow",
            "Resource": "*",
            "Sid": "IPV6"
        },
        {
            "Action": "ec2:CreateTags",
            "Effect": "Allow",
            "Resource": "arn:aws:ec2:*:*:network-interface/*",
            "Sid": "CreateTags"
        }
    ],
    "Version": "2012-10-17"
}

and it works! 😅

@zhengyongtao
Copy link

I've had the same problem these two weeks, has someone found a solution?

@jayanthvn
Copy link
Contributor

I've had the same problem these two weeks, has someone found a solution?

Can you please share the last few lines of ipamd logs before aws node restarts?

@zhengyongtao
Copy link

zhengyongtao commented Oct 3, 2022

I've had the same problem these two weeks, has someone found a solution?

Can you please share the last few lines of ipamd logs before aws node restarts?

ipamd log:

{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.43.0"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.43.0/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.60.1"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.60.1/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.47.2"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.47.2/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.46.131"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.46.131/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.61.196"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.61.196/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.49.6"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.49.6/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.41.135"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.41.135/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.38.218"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.38.218/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.39.157"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.39.157/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1418","msg":"Trying to add 192.168.59.213"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"Adding 192.168.59.213/32 to DS for eni-00023922abf62516c"}
{"level":"info","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1542","msg":"IP already in DS"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Reconcile existing ENI eni-00023922abf62516c IP prefixes"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1351","msg":"Found prefix pool count 0 for eni eni-00023922abf62516c\n"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:653","msg":"Successfully Reconciled ENI/IP pool"}
{"level":"debug","ts":"2022-10-03T15:42:25.909Z","caller":"ipamd/ipamd.go:1396","msg":"IP pool stats: Total IPs/Prefixes = 87/0, AssignedIPs/CooldownIPs: 31/0, c.maxIPsPerENI = 29"}
command terminated with exit code 137

aws-node:

# kubectl logs -f aws-node-zdp6x --tail 30 -n kube-system  
{"level":"info","ts":"2022-10-02T14:56:07.820Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2022-10-02T14:56:07.821Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2022-10-02T14:56:07.833Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2022-10-02T14:56:07.834Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2022-10-02T14:56:09.841Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:11.847Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:13.853Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:15.860Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:17.866Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:19.872Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:21.878Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:23.884Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:25.890Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:27.897Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:29.903Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:31.909Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:33.916Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:35.922Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:37.928Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:39.934Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:41.940Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:43.947Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:45.953Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:47.959Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2022-10-02T14:56:49.966Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}

Event screenshots:

image

I used cluster-autoscaler for auto-scaling, k8s version is 1.22, also following the troubleshooting guide https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues and applying the suggestion
kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.4/config/master/aws-k8s-cni.yaml

Interestingly, this failure usually only occurs on a certain node, and when I terminate the instance of that node and make it automatically expand again, it starts working.

But after running for a while, it will restart again

@jayanthvn
Copy link
Contributor

@koolwithk helm chart 1.2.0 has cni image version 1.12.0, you mean with 1.12.0 ipamd is failing to start? Can you please share ipamd.log? Also are you using EKS AMI?

@gaganyaan2
Copy link

gaganyaan2 commented Nov 15, 2022

@jayanthvn Sorry, I should have clearly stated earlier that in my automation script the image passed as hardcoded argument to helm using --set that's why the automation was picking the new helm chart 1.2.0 and old image amazon-k8s-cni:v1.11.3-eksbuild.1

  • CNI image v1.12.0 was not failing. it was actually amazon-k8s-cni:v1.11.3-eksbuild.1 image which was deployed using 1.2.0 helm chart.
  • Yes, It's a custom ami build using amazon-eks-ami

After hardcoding helm chart version to v1.1.21 and image amazon-k8s-cni:v1.11.3-eksbuild.1 . It worked.

@jayanthvn
Copy link
Contributor

No problem, thanks for confirming. Yes with helm chart 1.2.0 and CNI build 1.11.3 will have issue since the dockershim sock was removed - #2122

@jdn5126 jdn5126 assigned jdn5126 and unassigned achevuru Nov 22, 2022
@trallnag
Copy link

trallnag commented Dec 6, 2022

Just encountered this error again, but only with a single node. On all other nodes it worked. I manually deleted the node and the new one did not have the error anymore. Kubernetes v1.22 and VPC CNI v1.11.4 using the official add-on.

@jayanthvn
Copy link
Contributor

@trallnag - Did you get a chance to collect ipamd.log on the impacted node? If so did it show any error?

@trallnag
Copy link

trallnag commented Dec 6, 2022

@jayanthvn, nope. Missed to do that. I'll report back should the issue reoccur. Planning to upgrade to v1.12 tomorrow so maybe I will recycle a few nodes before performing the upgrade to reproduce the issue.

@ghost
Copy link

ghost commented Jan 14, 2023

It happened in my case because the aws-node daemonset was missing the permissions to manage the IP addresses of nodes and pods. The daemonset uses the K8s service account named aws-node. Solved it by creating an IAM role with AmazonEKS_CNI_Policy and attaching the role to the service account. To attach the role, add an annotation to the service account named aws-node and restart the daemonset.

 annotations:
       eks.amazonaws.com/role-arn: your-role-arn

As mentioned in some answers, it's not a good security practise to attach the AmazonEKS_CNI_Policy to the nodes directly, refer https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#use-separate-iam-role-for-cni to know more.

@nehatomar12
Copy link

@muhd-aslm in my case all aws-node-* pods are running

{"level":"info","ts":"2023-01-19T13:46:22.892Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2023-01-19T13:46:22.893Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2023-01-19T13:46:22.921Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2023-01-19T13:46:22.923Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
I0119 13:46:24.483956      12 request.go:655] Throttling request took 1.046230814s, request: GET:https://10.21.0.1:443/apis/storage.k8s.io/v1beta1?timeout=32s
{"level":"info","ts":"2023-01-19T13:46:24.933Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-01-19T13:46:26.943Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
W0119 13:46:27.804083      12 warnings.go:70] spec.configSource: deprecated in v1.22, support removal is planned in v1.23
{"level":"info","ts":"2023-01-19T13:46:28.954Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-01-19T13:46:28.983Z","caller":"entrypoint.sh","msg":"Copying config file ... "}
{"level":"info","ts":"2023-01-19T13:46:28.993Z","caller":"entrypoint.sh","msg":"Successfully copied CNI plugin binary and config file."}
{"level":"info","ts":"2023-01-19T13:46:28.994Z","caller":"entrypoint.sh","msg":"Foregrounding IPAM daemon ..."}

but ping is not working

❯ kubectl exec -i -t dnsutils -- ping google.com
^[[Oping: unknown host google.com
command terminated with exit code 2

@ghost
Copy link

ghost commented Jan 25, 2023

@nehatomar12 can you refer this document to see whether you have the same issue
https://yashmehrotra.com/posts/the-case-of-the-missing-packet-an-eks-migration-tale/

@presidenten
Copy link

For those coming here after upgrading EKS try re-applying the VPC CNI manifest file, for example: kubectl apply -f https://raw.githubusercontent.com/aws/amazon-vpc-cni-k8s/v1.11.3/config/master/aws-k8s-cni.yaml

@esidate Thanks! This fix it for me as well.

@fpmanuel
Copy link

fpmanuel commented May 5, 2023

@ermiaqasemi From this tutorial I chose to attach the AmazonEKS_CNI_Policy to the aws-node service account and I was getting the error.

I decided to try simply attaching it to the AmazonEKSNodeRole, which apparently is the less recommended way to do it, but it works.

This was the solution for me!

@joejulian
Copy link

I'm also having the problem where ipamd is failing to connect, but I have a different (and reliable) method of creating the problem.

EKS 1.23
Worker AMI Id: ami-055e3d14d238cbddd (ubuntu-eks/k8s_1.23/images/hvm-ssd/ubuntu-focal-20.04-amd64-server-20230430)

All I have to do is reboot a worker. aws-node fails to connect every time.

@jdn5126
Copy link
Contributor

jdn5126 commented May 12, 2023

@joejulian this sounds like a different issue than the original reported problem. Can you please open a new issue and include more information? What VPC CNI version? What does /var/log/aws-routed-eni/ipamd.log show? I see that you are using Ubuntu, so you may want to look at some of the known issues in our troubleshooting doc: https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md#known-issues

@blu3r4y
Copy link

blu3r4y commented May 18, 2023

I experienced the same issue. In our cluster, the kube-proxy version was too old.
After updating it to a version that is compatible with the cluster version, the nodes started up fine.

@cilindrox
Copy link

Just a quick heads up that I noticed the same error as reported here, but it was #2393 - disabling prom-adapter got vpc-cni back on track.

@endersonmaia
Copy link

I get a similar error when updating AWS EKS from 1.24 to 1.25 via AWS CDK.

{"level":"info","ts":"2023-07-13T13:50:57.779Z","caller":"entrypoint.sh","msg":"Validating env variables ..."}
{"level":"info","ts":"2023-07-13T13:50:57.780Z","caller":"entrypoint.sh","msg":"Install CNI binaries.."}
{"level":"info","ts":"2023-07-13T13:50:57.800Z","caller":"entrypoint.sh","msg":"Starting IPAM daemon in the background ... "}
{"level":"info","ts":"2023-07-13T13:50:57.802Z","caller":"entrypoint.sh","msg":"Checking for IPAM connectivity ... "}
{"level":"info","ts":"2023-07-13T13:50:59.817Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:01.827Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:03.838Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:05.851Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:07.862Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
{"level":"info","ts":"2023-07-13T13:51:09.871Z","caller":"entrypoint.sh","msg":"Retrying waiting for IPAM-D"}
....

Using this version: amazon-k8s-cni-init:v1.10.1

@endersonmaia
Copy link

Updating the kube-proxy to a more recent version solved the problem.

@dwoods
Copy link

dwoods commented Aug 15, 2023

I let kube-proxy get behind, and updating it solved the issue for me as well.

@DrackThor
Copy link

I experienced the same issue. In our cluster, the kube-proxy version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.

This also fixed our problem - thank's a million for this hint!

@arianitu
Copy link

arianitu commented Oct 6, 2023

I experienced the same issue. In our cluster, the kube-proxy version was too old. After updating it to a version that is compatible with the cluster version, the nodes started up fine.

Same here, this is from upgrading from Kubernetes 1.21 -> 1.25 under AWS EKS where aws-node failed to start with these logs:

kubectl logs of aws-node did not reveal much:

time="2023-10-06T20:07:03Z" level=info msg="Starting IPAM daemon... "
time="2023-10-06T20:07:03Z" level=info msg="Checking for IPAM connectivity... "

had to login to the aws-node manually (from https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/troubleshooting.md)

kubectl exec -it aws-node-vd5r8 -n kube-system -c aws-eks-nodeagent /bin/bash

find the log file under /var/log/aws-routed-eni, should be a file called ipamd*.log

{"level":"error","ts":"2023-10-06T20:27:43.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:27:49.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:27:55.657Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}
{"level":"error","ts":"2023-10-06T20:28:01.658Z","caller":"wait/wait.go:109","msg":"Unable to reach API Server, Get \"[https://10.100.0.1:443/version?timeout=5s](https://10.100.0.1/version?timeout=5s)\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}

The only thing I did not upgrade in the cluster was kube-proxy so I assumed that was the issue and it was, glad someone else had the same experience.

Make sure to go through each of these.. I did not have addons for anything so I had to go through the self managed rout which sucks. I hope there is maybe a way to go from self managed to addons?

@jdn5126 jdn5126 removed their assignment Oct 13, 2023
@alant94
Copy link

alant94 commented Oct 24, 2023

I hope there is maybe a way to go from self managed to addons?

@arianitu hi, FYI: there is a way. It is described in this documentation page for example. Just did it on my clusters that are deployed with terraform. And now we can easier manage add-ons going forward.

This is the code I added to terraform that migrated self-managed add-ons to eks add-ons. I had default configurations in these add-ons, still creation was failing without OVERWRITE, so added it and then it completed successfully.

# https://docs.aws.amazon.com/eks/latest/userguide/managing-vpc-cni.html
resource "aws_eks_addon" "vpc_cni" {
  cluster_name                = aws_eks_cluster.cluster.name
  addon_name                  = "vpc-cni"
  addon_version               = "v1.15.1-eksbuild.1"
  resolve_conflicts_on_create = "OVERWRITE"
}

# https://docs.aws.amazon.com/eks/latest/userguide/managing-coredns.html
resource "aws_eks_addon" "coredns" {
  cluster_name                = aws_eks_cluster.cluster.name
  addon_name                  = "coredns"
  addon_version               = "v1.9.3-eksbuild.7"
  resolve_conflicts_on_create = "OVERWRITE"
}

# https://docs.aws.amazon.com/eks/latest/userguide/managing-kube-proxy.html
resource "aws_eks_addon" "kube_proxy" {
  cluster_name                = aws_eks_cluster.cluster.name
  addon_name                  = "kube-proxy"
  addon_version               = "v1.24.17-eksbuild.2"
  resolve_conflicts_on_create = "OVERWRITE"

  depends_on = [aws_eks_node_group.group]
}

@wenfengwang
Copy link

It happened in my case because the aws-node daemonset was missing the permissions to manage the IP addresses of nodes and pods. The daemonset uses the K8s service account named aws-node. Solved it by creating an IAM role with AmazonEKS_CNI_Policy and attaching the role to the service account. To attach the role, add an annotation to the service account named aws-node and restart the daemonset.

 annotations:
       eks.amazonaws.com/role-arn: your-role-arn

As mentioned in some answers, it's not a good security practise to attach the AmazonEKS_CNI_Policy to the nodes directly, refer https://aws.github.io/aws-eks-best-practices/networking/vpc-cni/#use-separate-iam-role-for-cni to know more.

it WORKS for me! Thanks a lot !

@jdn5126
Copy link
Contributor

jdn5126 commented Jan 25, 2024

I'm going to close this issue, as I think it's length and story is too tough to follow. It will still be searchable by others in the future, and we can rely on new issues to triage IPAMD errors during start.

@jdn5126 jdn5126 closed this as completed Jan 25, 2024
Copy link

This issue is now closed. Comments on closed issues are hard for our team to see.
If you need more assistance, please either tag a team member or open a new issue that references this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests