Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ec2, hotplug: race between udev add event and IMDS data not being available #5373

Closed
aciba90 opened this issue Jun 5, 2024 · 8 comments
Closed
Labels
bug Something isn't working correctly ds: ec2 Issues specific to DataSourceEc2

Comments

@aciba90
Copy link
Contributor

aciba90 commented Jun 5, 2024

Bug report

Cloud-init hits what @nmeyerhans mentioned in #4799 (comment):

One thing to be aware of when accessing IMDS in response to udev add events is that IMDS propagates asynchronously, and the data corresponding with a newly attached network interface may not be complete at the time that the handler is running.

Sometimes, the IMDS does not have the full metadata about a hotplugged NIC when triggered by a hotplug udev event.

We can see the difference in instance network metadata after a reboot:

--- tmp/pre/cloud-init-logs-2024-06-05/run/cloud-init/instance-data.json        2024-06-05 11:46:07.000000000 +0200
+++ tmp/post/cloud-init-logs-2024-06-05/run/cloud-init/instance-data.json       2024-06-05 12:49:30.000000000 +0200
@@ -91,7 +91,7 @@
      "info": {
       "AccountId": "937157663530",
       "Code": "Success",
-      "LastUpdated": "2024-06-05T09:44:21Z"
+      "LastUpdated": "2024-06-05T10:33:52Z"
      }
     }
    },
@@ -141,6 +141,7 @@
       "06:dc:23:8e:6c:bc": {
        "device-number": "1",
        "interface-id": "eni-08f52ea023734d621",
+       "ipv6s": "2a05:d011:311:a00:c4f1:eb1b:92d2:200e",
        "local-hostname": "ip-192-168-5-5.eu-south-2.compute.internal",
        "local-ipv4s": "192.168.5.5",
        "mac": "06:dc:23:8e:6c:bc",
@@ -155,8 +156,11 @@
        ],
        "subnet-id": "subnet-0877465f33a6d8dde",
        "subnet-ipv4-cidr-block": "192.168.0.0/20",
+       "subnet-ipv6-cidr-blocks": "2a05:d011:311:a00:0:0:0:0/64",
        "vpc-id": "vpc-031a740d79767510f",
-       "vpc-ipv4-cidr-block": "192.168.0.0/20"
+       "vpc-ipv4-cidr-block": "192.168.0.0/20",
+       "vpc-ipv4-cidr-blocks": "192.168.0.0/20",
+       "vpc-ipv6-cidr-blocks": "2a05:d011:311:a00:0:0:0:0/56"
       }
      }
     }

Steps to reproduce the problem

export CLOUD_INIT_CLOUD_INIT_SOURCE=cloud-init_all.deb
export CLOUD_INIT_PLATFORM=ec2
export CLOUD_INIT_OS_IMAGE=noble

while true; do
	tox -e integration-tests -- --pdb \
		tests/integration_tests/modules/test_hotplug.py::test_multi_nic_hotplug_vpc
done

And observe:


    @pytest.mark.skipif(CURRENT_RELEASE <= FOCAL, reason="See LP: #2055397")
    @pytest.mark.skipif(PLATFORM != "ec2", reason="test is ec2 specific")
    def test_multi_nic_hotplug_vpc(setup_image, session_cloud: IntegrationCloud):
        """Tests that additional secondary NICs are routable from local
        networks after the hotplug hook is executed when network updates
        are configured on the HOTPLUG event."""
        with session_cloud.launch(
            user_data=USER_DATA
        ) as client, session_cloud.launch() as bastion:
            ips_before = _get_ip_addr(client)
            primary_priv_ip4 = ips_before[1].ip4
            primary_priv_ip6 = ips_before[1].ip6
            client.instance.add_network_interface(ipv6_address_count=1)
    
            _wait_till_hotplug_complete(client)
            log_content = client.read_from_file("/var/log/cloud-init.log")
            verify_clean_log(log_content)
    
            netplan_cfg = client.read_from_file("/etc/netplan/50-cloud-init.yaml")
            config = yaml.safe_load(netplan_cfg)
    
            ips_after_add = _get_ip_addr(client)
            secondary_priv_ip4 = ips_after_add[2].ip4
            secondary_priv_ip6 = ips_after_add[2].ip6
            assert primary_priv_ip4 != secondary_priv_ip4
    
            new_addition = [
                ip for ip in ips_after_add if ip.ip4 == secondary_priv_ip4
            ][0]
            assert new_addition.interface in config["network"]["ethernets"]
            new_nic_cfg = config["network"]["ethernets"][new_addition.interface]
            assert "routing-policy" in new_nic_cfg
>           assert [
                {"from": secondary_priv_ip4, "table": 101},
                {"from": secondary_priv_ip6, "table": 101},
            ] == new_nic_cfg["routing-policy"]
E           AssertionError: assert [{'from': '192.168.3.64', 'table': 101}, {'from': 'fe80::ca:54ff:fee0:3c8f', 'table': 101}] == [{'from': '192.168.3.64', 'table': 101}]
E             
E             Left contains one more item: {'from': 'fe80::ca:54ff:fee0:3c8f', 'table': 101}
E             
E             Full diff:
E               [
E                   {
E                       'from': '192.168.3.64',
E                       'table': 101,
E                   },
E             +     {
E             +         'from': 'fe80::ca:54ff:fee0:3c8f',
E             +         'table': 101,
E             +     },
E               ]

Environment details

  • Cloud-init version: 24.1 and tip of main.
  • Operating System Distribution: Ubuntu
  • Cloud provider, platform or installer type: Ec2

cloud-init logs

cloud-init.tar.gz
cloud-init-post-reboot.tar.gz

@aciba90 aciba90 added bug Something isn't working correctly new An issue that still needs triage labels Jun 5, 2024
@aciba90
Copy link
Contributor Author

aciba90 commented Jun 5, 2024

Some kind of retry / waiting has to be added when required networking properties from the IMDS are missing.

@TheRealFalcon
Copy link
Member

@aciba90 , so this has always been an issue with hotplug? We just happen to be seeing it now?

@aciba90
Copy link
Contributor Author

aciba90 commented Jun 5, 2024

@TheRealFalcon: I think so, the lack of #5271 was shadowing this race condition, which does not happen all times.

I believe #5283 is due to this race.

@blackboxsw blackboxsw added this to the cloud-init-24.3 milestone Jun 5, 2024
@aciba90 aciba90 removed the new An issue that still needs triage label Jun 7, 2024
@aciba90
Copy link
Contributor Author

aciba90 commented Jun 11, 2024

Tracking in: SF#00387392.

aciba90 added a commit to aciba90/cloud-init that referenced this issue Jun 26, 2024
Those checks are going to fail due to and until canonical#5373 is fixed. Disabled
them to get a better feedback from integration tests.
aciba90 added a commit to aciba90/cloud-init that referenced this issue Jun 26, 2024
Those checks are going to fail due to and until canonical#5373 is fixed. Disabled
them to get a better feedback from integration tests.
TheRealFalcon added a commit to TheRealFalcon/cloud-init that referenced this issue Jul 10, 2024
It is pretty consistently failing due to canonical#5373 with no fix in
sight.
blackboxsw pushed a commit that referenced this issue Jul 10, 2024
It is pretty consistently failing due to #5373 with no fix in
sight.
@aciba90
Copy link
Contributor Author

aciba90 commented Jul 11, 2024

Test disabled in #5503.

@aciba90
Copy link
Contributor Author

aciba90 commented Aug 2, 2024

Per responses in SF#00387392, there is no better solution to synchronize with the IMDS, but implementing a wait / retry mechanism.

Another solution, outlined by Noah in [1], would be to migrate our current implementation and configure PBR as a dhcp exit hook. This would have access to all required information, but we would need to implement the hooks for every dhcp client that cloud-init supports, or at least minimally for every default dhcp in the current Ubuntu supported releases:

  1. We can configure routing in a dhclient exit hook. This should have
    access to all the details that dhclient has configured on the interface
    via the environment. However, it tightens our coupling with dhclient,
    which may not be desirable. There's been some talk of moving to
    systemd-networkd for interface configuration.

In summary, I see two possible solutions:

  1. Implement a wait / retry mechanism on top of the current implementation: udev rules triggering after a NIC is added and getting the info from the IMDS.
  2. Implement dhcp hooks.

[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=963826

@TheRealFalcon
Copy link
Member

TheRealFalcon commented Aug 2, 2024

Removing from the 24.3 milestone as this is likely going to be addressed in the 25.04 lifecycle

@TheRealFalcon TheRealFalcon removed this from the cloud-init-24.3 milestone Aug 2, 2024
holmanb pushed a commit to holmanb/cloud-init that referenced this issue Aug 2, 2024
It is pretty consistently failing due to canonical#5373 with no fix in
sight.
holmanb pushed a commit that referenced this issue Aug 6, 2024
It is pretty consistently failing due to #5373 with no fix in
sight.
@aciba90
Copy link
Contributor Author

aciba90 commented Aug 14, 2024

Tracking in SC-1850.

@github-actions github-actions bot added the Stale label Sep 9, 2024
@aciba90 aciba90 added ds: ec2 Issues specific to DataSourceEc2 and removed Stale labels Oct 1, 2024
holmanb added a commit to holmanb/cloud-init that referenced this issue Jan 31, 2025
Make tests more robust to temporary network failure.
Document hotplug limitations.

Fixes canonicalGH-5373
holmanb added a commit to holmanb/cloud-init that referenced this issue Jan 31, 2025
Make tests more robust to temporary network failure.
Document hotplug limitations.

Fixes canonicalGH-5373
holmanb added a commit to holmanb/cloud-init that referenced this issue Jan 31, 2025
Make tests more robust to temporary network failure.
Document hotplug limitations.

Fixes canonicalGH-5373
@holmanb holmanb closed this as completed in d75840b Feb 3, 2025
holmanb added a commit to holmanb/cloud-init that referenced this issue Feb 11, 2025
Make tests more robust to temporary network failure.
Document hotplug limitations.

Fixes canonicalGH-5373
holmanb added a commit to holmanb/cloud-init that referenced this issue Feb 11, 2025
Make tests more robust to temporary network failure.
Document hotplug limitations.

Fixes canonicalGH-5373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly ds: ec2 Issues specific to DataSourceEc2
Projects
None yet
Development

No branches or pull requests

3 participants