P4de.24xlarge support #2425

lipovsek-aws · 2023-06-15T21:24:13Z

What happened:
I'm running IPv6 EKS cluster with p4de.24xlarge EC2 instance. P4de instances are used for workloads like large language models where we need large clusters and IPv6 helps with IP exhaustion. I noticed that aws-node pod is failing on p4de.24xlarge instances, but other instances from C and M instance families ran without any issues.

Attach logs
aws-node output:

Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
Installed /host/opt/cni/bin/aws-cni
time="2023-06-15T13:03:59Z" level=info msg="Starting IPAM daemon... "
Installed /host/opt/cni/bin/egress-v4-cni
time="2023-06-15T13:03:59Z" level=info msg="Checking for IPAM connectivity... "
time="2023-06-15T13:04:00Z" level=info msg="Copying config file... "
time="2023-06-15T13:04:00Z" level=info msg="Successfully copied CNI plugin binary and config file."
time="2023-06-15T13:04:00Z" level=error msg="Failed to wait for IPAM daemon to complete" error="exit status 1"

I then looked further in kubelet logs for failing p4de.24xlarge node where I found logs like

plugin type=\\\"aws-cni\\\" name=\\\"aws-cni\\\" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused

IPAMD log (/var/log/aws-routed-eni) for failing p4de.24xlarge node then gives us the real reason:

{"level":"debug","ts":"2023-06-15T18:51:53.934Z","caller":"ipamd/ipamd.go:2285","msg":"Check if instance supports Prefix Delegation"}
{"level":"debug","ts":"2023-06-15T18:51:53.934Z","caller":"awsutils/awsutils.go:1472","msg":"Instance hypervisor family unknown"}
{"level":"debug","ts":"2023-06-15T18:51:53.934Z","caller":"awsutils/awsutils.go:1472","msg":"Bare Metal Instance %!s(bool=false)"}
{"level":"error","ts":"2023-06-15T18:51:53.934Z","caller":"ipamd/ipamd.go:418","msg":"Prefix Delegation is not supported on non-nitro instance p4de.24xlarge. IPv6 is only supported in Prefix delegation Mode. "}

What you expected to happen:
P4de.24xlarge is nitro based instance and should support prefix mode for IPv6.

How to reproduce it (as minimally and precisely as possible):
Run ipv6 EKS cluster with p4de.24xlarge. One of the options is terraform and eksctl.

Anything else we need to know?:
Looking at the code we can see we have hardcoded in CNI values that cause this error down the line.

Environment:

Kubernetes version (use kubectl version): 1.27
CNI Version: v1.12.6-eksbuild.2
OS (e.g: cat /etc/os-release): See ami amazon-eks-node-1.27-v20230607.
Kernel (e.g. uname -a): See ami amazon-eks-node-1.27-v20230607.

The text was updated successfully, but these errors were encountered:

jdn5126 · 2023-06-19T00:51:42Z

Closing as this is fixed in v1.13.2 release

github-actions · 2023-06-19T00:52:03Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

lipovsek-aws added the bug label Jun 15, 2023

jdn5126 mentioned this issue Jun 16, 2023

Fix hard-coded nitro instance types: p4de.24xlarge and c7g.metal #2428

Merged

jdn5126 closed this as completed Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

P4de.24xlarge support #2425

P4de.24xlarge support #2425

lipovsek-aws commented Jun 15, 2023 •

edited

Loading

jdn5126 commented Jun 19, 2023

github-actions bot commented Jun 19, 2023

P4de.24xlarge support #2425

P4de.24xlarge support #2425

Comments

lipovsek-aws commented Jun 15, 2023 • edited Loading

jdn5126 commented Jun 19, 2023

github-actions bot commented Jun 19, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

lipovsek-aws commented Jun 15, 2023 •

edited

Loading