-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ec2): support multi NIC/IP setups #4799
Conversation
0468bb1
to
0b3ae1d
Compare
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
…l#4799) Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 99-cloud-init-hook-hotplug.rules as ID_NET_DRIVER was not yet defined at matching-time. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
…l#4799) Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 99-cloud-init-hook-hotplug.rules as ID_NET_DRIVER was not yet defined at matching-time. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003
0b3ae1d
to
027848b
Compare
…l#4799) Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 99-cloud-init-hook-hotplug.rules as ID_NET_DRIVER was not yet defined at matching-time. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003
027848b
to
00b4b41
Compare
I would be interested in @nmeyerhans 's feedback on this if you have some time. This is intended to provide multi-nic use cases for ubuntu based ec2 instances (i.e. a canonical supported alternative to ec2-net-utils). I don't know that this PR is intended to cover every use case yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks good! I left some comments inline.
More broadly, do we need to use separate routing tables to accomplish this? Netplan/V2 allow specifying routes per interface. Can specifying from
on a route rather than specifying a table work for this?
If not, this code is netplan-specific (not v2-specific), and we'll need to gate it as such.
The issue is that standard IP routing cannot take the source address of an outgoing packet into consideration. Source-based routing (aka policy routing) is necessary because VPCs (in the default configuration in which src/dest checking is enabled) enforce that packets egressing an ENI must have a source address that matches the list of addresses or prefixes assigned to that ENI in the VPC control plane. Packets that do not meet these criteria are dropped. Policy routing, which allows routing decisions based on additional factors such as the source address, is necessary in order to comply with VPCs restrictions. A common scenario that triggers this relates to return traffic in response to an inbound packet to an instance with multiple attached ENIs. That traffic will come in with a destination IP matching one of the instance's interfaces. When the response is sent, that address will become the source address of the reply packet, and routing will be performed according to the entries in the main routing table. The routing decision will take the destination of the outgoing packet into account, but not the source. Thus it won't necessarily result in this response packet egressing the same ENI on which the original packet was received. If it egresses a different ENI, it will be dropped. The policy rules let us ensure that the traffic is routed according to a table that only knows about the interfaces with a given source address associated with them in the VPC control plane. There isn't a way to ensure routing compliance with these VPC restrictions with standard destination-based routing behavior. |
00b4b41
to
07f8c5e
Compare
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
…l#4799) Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 99-cloud-init-hook-hotplug.rules as ID_NET_DRIVER was not yet defined at matching-time. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
…l#4799) Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 99-cloud-init-hook-hotplug.rules as ID_NET_DRIVER was not yet defined at matching-time. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003
07f8c5e
to
a8215c4
Compare
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
…l#4799) Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 99-cloud-init-hook-hotplug.rules as ID_NET_DRIVER was not yet defined at matching-time. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003
415c061
to
928c7b3
Compare
) Some instances, as p5 instances, can have multiple network cards and repeated device-numbers within them, see [0,1]. Add support to properly render the network configuration associated with them. References: [0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-efa.html
928c7b3
to
eaf420f
Compare
Some instances, as p5 instances, can have multiple network cards and repeated device-numbers within them, see [0,1]. Add support to properly render the network configuration associated with them. References: [0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-efa.html
Now that LP: #1946003 cannot happen on ec2, because we match on NIC drivers, enable network updates on hotplug events by default on the platform.
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 90-cloud-init-hook-hotplug.rules as ID_NET_DRIVER is not defined until [2]80-net-setup-link.rules is sourced. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003 [2] https://github.com/systemd/systemd/blob/main/rules.d/80-net-setup-link.rules
Some instances, as p5 instances, can have multiple network cards and repeated device-numbers within them, see [0,1]. Add support to properly render the network configuration associated with them. References: [0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-efa.html
Now that LP: #1946003 cannot happen on ec2, because we match on NIC drivers, enable network updates on hotplug events by default on the platform.
Avoids LP: #1946003 on upgraded systems. References: [0] canonical#4799 [1] canonical@b519d86
Avoids LP: #1946003 on upgraded systems. References: [0] canonical#4799 [1] canonical@b519d86
Avoids LP: #1946003 on upgraded systems. References: [0] canonical#4799 [1] canonical@b519d86
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 90-cloud-init-hook-hotplug.rules as ID_NET_DRIVER is not defined until [2]80-net-setup-link.rules is sourced. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003 [2] https://github.com/systemd/systemd/blob/main/rules.d/80-net-setup-link.rules
Some instances, as p5 instances, can have multiple network cards and repeated device-numbers within them, see [0,1]. Add support to properly render the network configuration associated with them. References: [0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-efa.html
Now that LP: #1946003 cannot happen on ec2, because we match on NIC drivers, enable network updates on hotplug events by default on the platform.
Avoids LP: #1946003 on upgraded systems. References: [0] canonical#4799 [1] canonical@b519d86
Avoids LP: #1946003 on upgraded systems. References: [0] canonical#4799 [1] canonical@b519d86 Co-authored-by: Chad Smith <chad.smith@canonical.com>
For EC2 instances with multiple NICs, policy-based routing will be configured on secondary NICs / secondary IPs to ensure outgoing packets are routed via the correct interface. Without this extra routing config, traffic coming via secondary NICs was routed using the main routing table, which can only contain one default route and the kernel only takes the destination IP address into account when selecting a route. Packets for destination beyond local networks were always routed through the default route, the one associated with the primary NIC. If traffic based on specific source IP addresses is associated with another NIC, wihtout these routing policies, this traffic would flow over the default route and the connection couldn't be established. References: [1] https://bootstack.canonical.com/cases/00336928 [2] https://bootstack.canonical.com/cases/00377150
Add extra logic to only trigger hook-hotplug on NICs with known drivers on EC2. This aviods the hook being triggered on any add/remove event on net devices, causing uneeded CPU usage, as on instance that start and stop a lot of docker containers, see [1]. Rename 10-cloud-init-hook-hotplug.rules to 90-cloud-init-hook-hotplug.rules as ID_NET_DRIVER is not defined until [2]80-net-setup-link.rules is sourced. References: [1] https://bugs.launchpad.net/cloud-init/+bug/1946003 [2] https://github.com/systemd/systemd/blob/main/rules.d/80-net-setup-link.rules
Some instances, as p5 instances, can have multiple network cards and repeated device-numbers within them, see [0,1]. Add support to properly render the network configuration associated with them. References: [0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#network-cards [1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/p5-efa.html
Now that LP: #1946003 cannot happen on ec2, because we match on NIC drivers, enable network updates on hotplug events by default on the platform.
Avoids LP: #1946003 on upgraded systems. References: [0] canonical#4799 [1] canonical@b519d86 Co-authored-by: Chad Smith <chad.smith@canonical.com>
This PR essentially broke the routing logic in our production code. We will have to clean up these redundant PBR rules to get things back to work. IMHO, how we route the traffic (either local out or pass through traffic) in a multi-homed instance should be up to the end user, rather than relying on these automatically generated rules. My comment is based on the assumption that Linux instances running on AWS EC2 is for generic purposes and we can't really assume that most users would like this new behavior. There are a lot of ways to implement multi-wan like routing scheme and most of them will be taking advantage of the PBR rules, which means these extra rules will likely break those systems. |
@xdxu sorry about the breaking changes. We thought that providing a default working environment would add more value and that advanced users / use-cases could disable cloud-init's network configuration or provide the custom network configuration via cloud-init or other means. If you still disagree, please open a new issue to discuss it with more visibility, or consider hanging out in #cloud-init IRC channel on Libera. Thanks. |
cloud-init (24.1.3-0ubuntu1~20.04.1) focal; urgency=medium * Upstream snapshot based on 24.1.3. (LP: #2056100). List of changes from upstream can be found at https://raw.githubusercontent.com/canonical/cloud-init/24.1.3/ChangeLog cloud-init (24.1.2-0ubuntu1~20.04.1) focal; urgency=medium * refresh patches: - d/p/retain-ec2-default-net-update-events.patch * Upstream snapshot based on 24.1.2. (LP: #2056100). List of changes from upstream can be found at https://raw.githubusercontent.com/canonical/cloud-init/24.1.2/ChangeLog cloud-init (24.1.1-0ubuntu1~20.04.1) focal; urgency=medium * d/apport-general-hook.py: Move apport hook to main branch * d/cloud-init.maintscript: remove /etc/cloud/clean.d/README * d/cloud-init.logrotate: add logrotate config for cloud-init * d/cloud-init.templates: enable WSL datasource by default * d/p/keep-dhclient-as-priority-client.patch: - Upstream switched to dhcpcd, keep isc-dhclient as the client * d/p/revert-551f560d-cloud-config-after-snap-seeding.patch - Retain systemd ordering cloud-config.service After=snapd.seeded.service * d/p/retain-ec2-default-net-update-events.patch: Reverts 4dbb08f5f0cc4f41cf9dd1474f0600a11510a3c9 to not change behavior on stable releases. * d/po/templates.pot: update for wsl * d/cloud-init.postinst: change priority of hotplug rules. Avoids LP #1946003 on upgraded systems. References: [0] canonical/cloud-init#4799 [1] commit/b519d861aff8b44a0610c176cb34adcbe28df144 * refresh patches: - d/p/netplan99-cannot-use-default.patch - d/p/retain-netplan-world-readable.patch - d/p/status-do-not-remove-duplicated-data.patch - d/p/status-retain-recoverable-error-exit-code.patch * Upstream snapshot based on 24.1.1. (LP: #2056100). List of changes from upstream can be found at https://raw.githubusercontent.com/canonical/cloud-init/24.1.1/ChangeLog
cloud-init (24.1.3-0ubuntu1~22.04.1) jammy; urgency=medium * Upstream snapshot based on 24.1.3. (LP: #2056100). List of changes from upstream can be found at https://raw.githubusercontent.com/canonical/cloud-init/24.1.3/ChangeLog cloud-init (24.1.2-0ubuntu1~22.04.1) jammy; urgency=medium * refresh patches: - d/p/retain-ec2-default-net-update-events.patch * Upstream snapshot based on 24.1.2. (LP: #2056100). List of changes from upstream can be found at https://raw.githubusercontent.com/canonical/cloud-init/24.1.2/ChangeLog cloud-init (24.1.1-0ubuntu1~22.04.1) jammy; urgency=medium * d/apport-general-hook.py: Move apport hook to main branch * d/cloud-init.maintscript: remove /etc/cloud/clean.d/README * d/cloud-init.logrotate: add logrotate config for cloud-init * d/cloud-init.templates: enable WSL datasource by default * Drop d/p/retain-netplan-world-readable.patch: - Limit perms to 600 of /etc/netplan/50-cloud-init.yaml instead of 644 (LP: #2053157) * d/p/keep-dhclient-as-priority-client.patch: - keep dhclient as default client * d/p/revert-551f560d-cloud-config-after-snap-seeding.patch - Retain systemd ordering cloud-config.service After=snapd.seeded.service * d/p/retain-ec2-default-net-update-events.patch: Reverts 4dbb08f5f0cc4f41cf9dd1474f0600a11510a3c9 to not change behavior on stable releases. * d/po/templates.pot: update for wsl * d/cloud-init.postinst: change priority of hotplug rules. Avoids LP #1946003 on upgraded systems. References: [0] canonical/cloud-init#4799 [1] commit/b519d861aff8b44a0610c176cb34adcbe28df144 * refresh patches: - d/p/status-do-not-remove-duplicated-data.patch - d/p/status-retain-recoverable-error-exit-code.patch * Upstream snapshot based on 24.1.1. (LP: #2056100). List of changes from upstream can be found at https://raw.githubusercontent.com/canonical/cloud-init/24.1.1/ChangeLog
Read individual commits for more context.
Additional Context
https://warthogs.atlassian.net/browse/SC-1662
https://bootstack.canonical.com/cases/00336928
https://bootstack.canonical.com/cases/00377150
Test Steps
Run the added integration tests on EC2.
Checklist
Merge type