Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EKS]: Next Generation AWS VPC CNI Plugin #398

Closed
tabern opened this issue Jul 4, 2019 · 59 comments
Closed

[EKS]: Next Generation AWS VPC CNI Plugin #398

tabern opened this issue Jul 4, 2019 · 59 comments
Assignees
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@tabern
Copy link
Contributor

tabern commented Jul 4, 2019

Edit 8/3/2020, see below comment for update on the status of this feature. There will not be a single new plugin release, but rather a series of new features on the existing plugin.

We are working on the next version of the Kubernetes networking plugin for AWS. We've gotten a lot of feedback around the need for adding Kubenet and support for other CNI plugins in EKS. This update to the VPC CNI plugin is specifically built to solve the common limitations customers experience today with the AWS VPC CNI version 1 and other Kubernetes CNI plugins.

Notably:

  • Limited pod density per worker node
  • Need to re-deploy nodes to update CNI and route configurations
  • Static allocation of CIDR blocks for pods

Architecturally, the next generation VPC CNI plugin differs from the existing CNI plugin. The new plugin cleanly separates functionality that was tightly coupled in the existing CNI:

  1. Wiring up the network for pods that runs on the Kubernetes worker nodes (data plane)
  2. Management of underlying EC2 networking infrastructure (control plane)

Pod networking (data plane) will continue to be part of the worker nodes, but the management of the networking infrastructure will be decoupled into a separate entity that will most likely run on the Kubernetes control plane. This will allow dynamic configuration of the CNI across a cluster, making it easy to support advanced networking configurations and change the networking configuration of the cluster on a per-service basis without restarting nodes.

These new functional behaviors are all supported while maintaining conformance to the standard Kubernetes network modelrequirements.

We think this CNI design will give customers the power and flexibility to run any workload size or density on a Kubernetes cluster using a single CNI plugin. We plan to implement this as the standard CNI for Amazon EKS, and release it as an open source project so that anyone running Kubernetes on AWS can utilize it.

The advantage of this approach is that it supports multiple networking modes and allows you to use them on the same cluster at the same time. We think these will be:

  1. Assign VPC secondary IP addresses to pods like the VPC CNI plugin does today.
  2. Allow pods to use IPs defined by a CIDR block assigned directly to a node. This is a separate CIDR range distinct from the node's network. This will give you the ability to have very large pod per node density – e.g. 256 or more pods on a [x].large EC2 instance without consuming your VPC IP space.
  3. Assign ENIs directly to pods. This mode takes advantage of EC2 ENI trunking and enables you to use all ENI features within the pod, such as assigning a security group directly to a pod.

You will be able to change which networking mode is used for pods on any given node and adjust the CIDR blocks used to assign IPs at any time. Additionally, the same VPC CNI plugin will work on both Linux and Windows nodes.

We're currently in the design and development stage of this plugin. We plan to release a beta of the CNI in the coming months. After this new CNI is generally available, we'll make it available in EKS. We do not plan to deprecate the current CNI plugin within EKS until we achieve parity between both generations of CNI plugins.

Let us know what you think below. We'll update this issue as we progress.

@tabern tabern added Proposed Community submitted issue EKS Amazon Elastic Kubernetes Service labels Jul 4, 2019
@tabern tabern changed the title [EKS]: The Next Generation VPC CNI Plugin [EKS]: Next Generation VPC CNI Plugin Jul 4, 2019
@tabern tabern changed the title [EKS]: Next Generation VPC CNI Plugin [EKS]: Next Generation AWS VPC CNI Plugin Jul 4, 2019
@gregoryfranklin
Copy link

Does IPv6 support feature in the new design? I'd like to be able to run a dual stack network, assigning both an IPv4 and an IPv6 address to each pod. In this configuration, the behaviour of Kubernetes is that it will use the IPv4 address for things like service endpoints but it would allow pods to connect to external IPv6 sites.
I did try patching the existing aws-vpc-cni to support this and identified a number of issues such as:

  • IPv6 was disabled in the eks ami
  • dhcpv6 would attempt to assign all the ipv6 addresses allocated to the ec2 instance to the primary network instance (meaning they could not be moved to the pods when dhcpv6 was running).
  • The aws vpc cni could only assign one ip address to the pod
  • The aws vpc cni could not request ipv6 addresses from the aws api.
    I have managed to get it working with some manual poking.

@sftim
Copy link

sftim commented Jul 9, 2019

assigning a security group directly to a pod

Definitely looking forward to this feature; there's plenty of uses for it.

@tabern
Copy link
Contributor Author

tabern commented Jul 9, 2019

@gregoryfranklin yes. While we are not currently planning to support IPv6 in the initial release, we believe this design is extensible and will allow us to support IPv6 in the future. Interested in learning more about the need for dual stack, I think this is a separate networking mode that we will need to consider.

@gregoryfranklin
Copy link

Interested in learning more about the need for dual stack

Dual stack is a migration path to IPv6-only.

We have several EKS clusters connected to a larger internal network via direct connects (hybrid cloud). IP address space is something we are having to start thinking about. Its not an immediate problem, but will be in the next few years, which means we are having to think about migration paths now.

For ingress, traffic comes through an ELB which can take inbound IPv6 traffic and connect to an IPv4 backend pod. However, for egress the pods need to have an IPv6 address to connect to IPv6 services (in addition to an IPv4 address to connect to IPv4 services).

Dual stack pods would allow us to run parts of the internal network IPv6-only. For example, a webapp running in EKS could use an IPv6 database in our own datacentres.

Being able to expose our apps to IPv6 traffic is an important step in identifying and fixing IPv6 bugs in our code and infrastructure (of which we have many). Also it stops developers from introducing new IPv6 bugs.
Full IPv6 support is expected to take several years after enabling support at a network level. Its therefore important to us that we have IPv6 support at the network level so that we can work on the layers above.

@Vlaaaaaaad
Copy link

+1 for IPv6 support due to IPv4 exhaustion. Especially when scaling EKS to higher number of nodes additional CIDRs have to be added to the VPC. IPv6 would be a perfect fix for this and would enable easier higher density EKS-clusters.

@sftim
Copy link

sftim commented Jul 11, 2019

Main reason I would want IPv6 is to run a cluster that is IPv6 only. Right now that's not something that Kubernetes itself supports very well; however, things seem to be catching up fast.

To handle connections from the public dual-stack internet, you could use Ingress, Proxy Protocol, etc (similar to how a typical cluster today maps from public IPv4 to private IPv4).

Possibly a SOCKS or HTTP proxy for outbound traffic too, which would allow access to IPv4-only APIs.

@reegnz
Copy link

reegnz commented Jul 15, 2019

We are very much enthusiastic about this next-gen plugin that would benefit us greatly:

  • the 'higher pod density for small instances' part, as we are running nodejs microservices where a single smaller instance (eg. t2.medium) is perfectly fine running 30-50 pods resource-wise, but the current CNI plugin imposes a pod limit that results in highly under-utilized nodes. That makes it hard to justify EKS compared to alternatives. We'd prefer a managed control plane on AWS though.

  • the native 'security group per pod' part, as it would (hopefully) reduce user-facing complexity compared to kube2iam

So to summarize, this proposal is something we are greatly anticipating, and IMHO this sounds much more like a production-ready 1.0 CNI plugin from AWS compared to the previous one (that sadly doesn't really work for us microservice guys).

Keep up the good work!

@ewbankkit
Copy link

@tabern Will the per-node pod CIDRs be implemented using kubenet or more like a full-blown overlay network?
Will this impose any limitations on the CNI used for NetworkPolicy, e.g. Calico or Cilium v1.6?

@jleadford
Copy link

security goal that'd be useful:

no matter the mode (i.e. ENI trunking or secondary IP approach) or user configuration (e.g. lack of Network Policies through, let's say, Calico), the CNI should prevent Pods from accessing the host's metadata endpoint. this is a common issue seen in practice, which results in unintended credential exposure.

seems straightforward to solve with an iptables rule at the node when setting up a container's veth pair in https://github.com/aws/amazon-vpc-cni-k8s/blob/6886c6b362e89f17b0ce100f51adec4d05cdcd30/plugins/routed-eni/driver/driver.go (i.e. block traffic to 169.254.169.254 from that veth interface), for the general case. I am not familiar with ECS trunking, so I cannot suggest an approach there.

@jleadford
Copy link

jleadford commented Aug 5, 2019

note that this rule construction is kube2iam's general approach https://github.com/jtblin/kube2iam#iptables, though it doesn't drop the traffic from the Pod outright, due to its feature set. they use a neat 'glob' I wasn't aware of, so you wouldn't need to create a rule per-veth at creation time (i.e. eni+ to match all after that prefix).

@mbelang
Copy link

mbelang commented Sep 3, 2019

any update on this @tabern ?

@jwenz723
Copy link

Is there a repository where the progress of this next-gen cni code can be viewed and tracked?

@shelby-moore
Copy link

Any thoughts around adding support for enforcing network policies to the cni plugin? It would be great if security groups could be used in the ingress/egress rules for network policies.

@alfredkrohmer
Copy link

alfredkrohmer commented Oct 1, 2019

We have certain use-cases where we need to expose the pods directly to the public internet, so they need a public IP (WebRTC, STUN/TURN). It would be an awesome feature if the new CNI plugin would be able to assign a public IP or EIP to pods (e.g. when a certain annotation on the pod is given) and also put the assigned IP into some status field or annotation of the pod.

Currently we are working around this by using autoscaling groups with node taints (dedicated=xyz:NoSchedule) and which assign public IPs to the instances. The instances in these autoscaling groups don't have the CNI plugin enabled and are only used by pods with host network enabled.

@MarcusNoble
Copy link

We believe that this feature roadmap will address the majority of networking challenges present today, however, we also realize that a single CNI plugin is unlikely to meet every possible use case, and to that end we have been working closely with our partners that maintain alternate compatible CNI plugins. These partners have developed EKS specific landing pages along with details on how to obtain commercial support, which we have highlighted in our documentation.

Is it ever going to be possible to use one of these partner CNIs with AdmissionWebhooks? E.g. routable from the API server to the overlay network?

@steven-tan
Copy link

I appreciate the recent update from @mikestef9, but I still have no sense of what this means in terms of timing. Our org has desperately wanted to switch to EKS for various reasons, but node density and CNI custom networking improvements are must-haves for us. I'm not expecting exact dates, but it feels like these improvements have been in the "coming months" stage for over a year. If these improvements aren't rolled out by say, EOY - it's quite probable we just have to skip our EKS plans altogether.

@eightnoteight
Copy link

eightnoteight commented Aug 4, 2020

We believe that this feature roadmap will address the majority of networking challenges present today, however, we also realize that a single CNI plugin is unlikely to meet every possible use case, and to that end we have been working closely with our partners that maintain alternate compatible CNI plugins. These partners have developed EKS specific landing pages along with details on how to obtain commercial support, which we have highlighted in our documentation.

Is it ever going to be possible to use one of these partner CNIs with AdmissionWebhooks? E.g. routable from the API server to the overlay network?

@MarcusNoble
I think one workaround you can do is setup a managed ingress like aws alb and make the admission webhooks come via the aws alb. we are using this method even for the aws cni to get more visibility around the callbacks

@MarcusNoble
Copy link

@eightnoteight That only really helps where you're managing the webhooks yourself. We've got a few third-party applications used in our clusters that set up webhooks for themself so we'd need to manually modify the manifests of those applications or in the case where they're created in code at runtime we'd need to fork and update the application. 😞

@vijaymateti
Copy link

@mikestef9 Will overlay network be option for pod networking or it will as be eni based?

@mikestef9
Copy link
Contributor

No, we will not be building any overlay option into the VPC CNI plugin, as that strays quite a bit from the original design goals of the plugin and will add too much complexity. Custom networking is our "overlay-like" option in the VPC CNI, but as I mentioned above, "we also realize that a single CNI plugin is unlikely to meet every possible use case", and added links to our docs that do list alternate CNI plugins with overlay options.

We feel the best solution to IPv4 exhaustion is IPv6, and that's where we are investing with the VPC CNI plugin.

@jontro
Copy link

jontro commented Aug 20, 2020

How does Increased Pod Density and Security Groups Per pod interoperate? Will the be compatible with each other? I saw a comment about a limit of 50 enis per node mentioned when it comes to vlan tagging

@jwenz723
Copy link

@mikestef9 I'm glad you are acknowledging that the VPC CNI cannot meet every possible use case, and I'm grateful for the documentation that has been added on how to install alternate CNIs on EKS. However, all of these alternate CNIs have the limitation that they cannot be installed on to the control plane master nodes, which I am sure you are aware of. This means that things like admission controller webhooks will fail, as well as other things that require a control plane node to communicate with a pod on a worker node. Are there any plans in place to fix this problem to allow 3rd party CNIs to be fully functional?

@duttab49
Copy link

duttab49 commented Sep 5, 2020

Hi @mikestef9 , is there any documentation available to configure POD's using Security Group related Custom resource definition?
Since this feature is available in latest VPC CNI 1.71, I would like to understand, more on the configuring SG per pod in EKS.
I want to try this feature of the VPC CNI.
Please suggest if any documentation available for this.

@mikestef9
Copy link
Contributor

Documentation is published

https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html

Stay tuned for further updates on #177

@duttab49
Copy link

Thanks, @mikestef9 for sharing

@snese
Copy link

snese commented Sep 10, 2020

Documentation is published

https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html

Stay tuned for further updates on #177

Nice!

Does @mikestef9 have any timeline for security-groups-for-pods on Fargate? It'll be useful to do the migration plan

@mikestef9
Copy link
Contributor

You can follow this issue #625 for updates on that feature request. No timeline to share right now. Note that the UX we have in mind there will be the same as the SecurityGroupPolicy CRD for worker nodes, and not something that is added to the Fargate Profile

@jwenz723
Copy link

jwenz723 commented Feb 9, 2021

I recently evaluated Managed Node Groups with the mistaken assumption that deleting the aws-node daemonset and installing a CNI with alternate IPAM like Calico would remove the max pod limit. I didn't realize that a bootstrap argument was required, and that Managed Node Groups do not support bootstrap arguments. So, this essentially means that anyone who requires more than the max pod limit per node is effectively limited to only unmanaged worker nodes at this point.

I thought I would post here for the benefit of anyone else considering Managed Node Groups, since many people will ultimately find the usefulness limited until this next-gen CNI is available.

There are indeed hacks like creating a daemonset that runs a script to update the node configuration (with a chroot to the host), or manually SSHing into the nodes. I created a support ticket for official advice about how to work around this issue and was informed that any such modifications to update the max pod limit are considered out of band, may introduce inconsistencies, and are not recommended or supported.

EDIT: Updated for clarity

@bencompton Sorry for resurrecting an old comment, but I wanted to post a solution to this problem here since no one else has yet.

Setting the max number of pods per node is a native kubelet functionality, see --max-pods. The AWS documentation suggests setting this value by passing something like --use-max-pods false --kubelet-extra-args '--max-pods=20' to the bootstrap.sh script. The bootstrap.sh script takes the received value and sets it into the kubelet config file using jq here.

It is not possible to pass the documentation suggested arguments to the bootstrap.sh script with managed worker nodes, however, it is possible to add custom userdata to a launch template that is utilized by your managed worker nodes. There are some requirements for the formatting of the userdata that are not typical, so make sure to familiarize yourself with the specifics here.

So to set a custom maxpods value you need to do 2 things:

  1. set USE_MAX_PODS to false when bootstrap.sh executes to prevent a maxPods value from being set in the kubelet config file
  2. set a custom maxPods value into the kubelet config file as done here

Here is my userdata which implements these 2 tasks:

#!/bin/bash
set -ex

BOOTSTRAP_SH=/etc/eks/bootstrap.sh
BOOTSTRAP_USE_MAX_PODS_SEARCH="USE_MAX_PODS:-true"
KUBELET_CONFIG=/etc/kubernetes/kubelet/kubelet-config.json
MAX_PODS=20 # put whatever quantity you want here

# set a maxPods value in the KUBELET_CONFIG file
echo "$(jq ".maxPods=$MAX_PODS" $KUBELET_CONFIG)" > $KUBELET_CONFIG

# search for the string to be replaced by sed and return a non-zero exit code if not found. This is used for safety in case the bootstrap.sh
# script gets changed in a way that is no longer compatible with our USE_MAX_PODS replacement command.
grep -q $BOOTSTRAP_USE_MAX_PODS_SEARCH $BOOTSTRAP_SH

# set the default for USE_MAX_PODS to false so that the maxPods value set in KUBELET_CONFIG will be honored
sed -i"" "s/$BOOTSTRAP_USE_MAX_PODS_SEARCH/USE_MAX_PODS:-false/" $BOOTSTRAP_SH

This is a workaround that works for now. This is certainly not recommended by AWS, and could break at some point in time depending on updates made to the bootstrap.sh script. So use this method with caution. Eventually this should no longer be needed based upon this comment from @mikestef9 above and #867:

  • Max pods must be manually calculated and passed to kubelet of worker nodes.
    • We’ll automate this process so users don’t need manually calculate a value that is dependent on networking mode. This will unlock CNI custom networking with Managed Node Groups.

@youwalther65
Copy link

This is my current workaround for CNI custom networking with MNG (managed node group) which is dynamic but requires access to IMDS,
EC2 API and internet for installing bc for calculation (could be done with built-in Python as well for sure ;-) ):

Custom launch template user data

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="

--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
#Amazon Linux 2 based script
#to determine instance type from instance metadata and calculate max pods for CNI custom networking
#and set this inside EKS bootstrap script

#install bc, requires internet access
yum -y install bc

#gather instance type from metadata
INST_TYPE=$(curl -s http://169.254.169.254/latest/meta-data/instance-type)

#gather region from metadata, jq is pre-installed
export AWS_DEFAULT_REGION=$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)

#gather ENI info, aws CLI is pre-installed, requires internet access
ENI_INFO=$(aws ec2 describe-instance-types --filters Name=instance-type,Values=$INST_TYPE --query "InstanceTypes[].[InstanceType, NetworkInfo.MaximumNetworkInterfaces, NetworkInfo.Ipv4AddressesPerInterface]" --output text)

#calculate max-pods
MAX_ENI=$(echo $ENI_INFO | awk '{print $2}')
MAX_IP=$(echo $ENI_INFO | awk '{print $3}')
MAX_PODS=$(echo "($MAX_ENI-1)*($MAX_IP-1)+2" | bc)

sed -i 's/^USE_MAX_PODS=.*/USE_MAX_PODS="false"/' /etc/eks/bootstrap.sh
sed -i '/^KUBELET_EXTRA_ARGS=/a KUBELET_EXTRA_ARGS+=" --max-pods='$MAX_PODS'"' /etc/eks/bootstrap.sh

--==MYBOUNDARY==--

@gauravmittal80
Copy link

Hi,

Any update when can we have IPv6 support for EKS. Also is there any workaround to have a dual stack support for pods in EKS

Regards,
Gaurav

@davidroth
Copy link

@mikestef9 Unfortunately this is no longer on the roadmap?

@billinghamj
Copy link

billinghamj commented Jul 16, 2021

@davidroth I think it was kinda replaced/broken up into more smaller features - like IPv6 support, higher IP density for pods on nodes, etc etc. So there's no longer going to be an explicit switch to a brand new plugin, more continuous improvements to the existing one :)

Edit: it was touched upon here: #398 (comment)

@FlorianOtel
Copy link

As the feature is now GA -- see https://aws.amazon.com/jp/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/ for details -- suggest closing this issue.

@mikestef9
Copy link
Contributor

mikestef9 commented Sep 7, 2021

@FlorianOtel the last outstanding pain point discussed originally in this issue IPv4 exhaustion. I plan on closing once we launch IPv6 support #835

@ChrisMcKee
Copy link

ChrisMcKee commented Dec 8, 2021

It would be greatly helpful if with regards to the VPC CNI plugin, and especially around windows support, if the documentation and troubleshooting would be updated to cover how you're supposed to debug the new wiring rather than as present covering how to debug the older webhooks/controller version. (if theres a separate repo for documentation, please let me know)

Part of this would appear to be working on and completing several aged prs in the CNI repo which help to address the way the CNI setup fails silently / without feedback.

@mikestef9
Copy link
Contributor

mikestef9 commented Jan 7, 2022

Closing as we have now released native VPC CNI features to address all of the initial pain points discussed in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests