Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P4de.24xlarge support #2425

Closed
lipovsek-aws opened this issue Jun 15, 2023 · 2 comments
Closed

P4de.24xlarge support #2425

lipovsek-aws opened this issue Jun 15, 2023 · 2 comments
Labels

Comments

@lipovsek-aws
Copy link

lipovsek-aws commented Jun 15, 2023

What happened:
I'm running IPv6 EKS cluster with p4de.24xlarge EC2 instance. P4de instances are used for workloads like large language models where we need large clusters and IPv6 helps with IP exhaustion. I noticed that aws-node pod is failing on p4de.24xlarge instances, but other instances from C and M instance families ran without any issues.

Attach logs
aws-node output:

Defaulted container "aws-node" out of: aws-node, aws-vpc-cni-init (init)
Installed /host/opt/cni/bin/aws-cni
time="2023-06-15T13:03:59Z" level=info msg="Starting IPAM daemon... "
Installed /host/opt/cni/bin/egress-v4-cni
time="2023-06-15T13:03:59Z" level=info msg="Checking for IPAM connectivity... "
time="2023-06-15T13:04:00Z" level=info msg="Copying config file... "
time="2023-06-15T13:04:00Z" level=info msg="Successfully copied CNI plugin binary and config file."
time="2023-06-15T13:04:00Z" level=error msg="Failed to wait for IPAM daemon to complete" error="exit status 1"

I then looked further in kubelet logs for failing p4de.24xlarge node where I found logs like

plugin type=\\\"aws-cni\\\" name=\\\"aws-cni\\\" failed (add): add cmd: Error received from AddNetwork gRPC call: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing dial tcp 127.0.0.1:50051: connect: connection refused

IPAMD log (/var/log/aws-routed-eni) for failing p4de.24xlarge node then gives us the real reason:

{"level":"debug","ts":"2023-06-15T18:51:53.934Z","caller":"ipamd/ipamd.go:2285","msg":"Check if instance supports Prefix Delegation"}
{"level":"debug","ts":"2023-06-15T18:51:53.934Z","caller":"awsutils/awsutils.go:1472","msg":"Instance hypervisor family unknown"}
{"level":"debug","ts":"2023-06-15T18:51:53.934Z","caller":"awsutils/awsutils.go:1472","msg":"Bare Metal Instance %!s(bool=false)"}
{"level":"error","ts":"2023-06-15T18:51:53.934Z","caller":"ipamd/ipamd.go:418","msg":"Prefix Delegation is not supported on non-nitro instance p4de.24xlarge. IPv6 is only supported in Prefix delegation Mode. "}

What you expected to happen:
P4de.24xlarge is nitro based instance and should support prefix mode for IPv6.

How to reproduce it (as minimally and precisely as possible):
Run ipv6 EKS cluster with p4de.24xlarge. One of the options is terraform and eksctl.

Anything else we need to know?:
Looking at the code we can see we have hardcoded in CNI values that cause this error down the line.

Environment:

  • Kubernetes version (use kubectl version): 1.27
  • CNI Version: v1.12.6-eksbuild.2
  • OS (e.g: cat /etc/os-release): See ami amazon-eks-node-1.27-v20230607.
  • Kernel (e.g. uname -a): See ami amazon-eks-node-1.27-v20230607.
@jdn5126
Copy link
Contributor

jdn5126 commented Jun 19, 2023

Closing as this is fixed in v1.13.2 release

@jdn5126 jdn5126 closed this as completed Jun 19, 2023
@github-actions
Copy link

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants