Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring duplicate macs for calico network #4975

Closed
wants to merge 1 commit into from
Closed

Ignoring duplicate macs for calico network #4975

wants to merge 1 commit into from

Conversation

xiaoge1001
Copy link
Contributor

@xiaoge1001 xiaoge1001 commented Feb 28, 2024

test_net failed in the calico network env. error msg is:

                 msg = "duplicate mac found! both '%s' and '%s' have mac '%s'." % (
                     name,
                     ret[mac],
                     mac,
                 )
 >               raise RuntimeError(msg)
 E               RuntimeError: duplicate mac found! both 'caliaadab05bca1' and 'cali7a099f8b737' have mac 'ee:ee:ee:ee:ee:ee'.

 cloudinit/net/__init__.py:987: RuntimeError

Checking the execution environment network interfaces information, it was found that multiple network interfaces's MAC addresses are ee:ee:ee:ee:ee:ee. eg:

 4789: calie4685cb71c1@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
     link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 5
     inet6 fe80::ecee:eeff:feee:eeee/64 scope link
        valid_lft forever preferred_lft forever
 4799: calib73b89e79e5@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
     link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 8
     inet6 fe80::ecee:eeff:feee:eeee/64 scope link
         valid_lft forever preferred_lft forever

Q:Why do all cali* interfaces have the MAC address ee:ee:ee:ee:ee:ee?
In some setups the kernel is unable to generate a persistent MAC address and so Calico assigns a MAC address itself. Since Calico uses point-to-point routed interfaces, traffic does not reach the data link layer so the MAC Address is never used and can therefore be the same for all the cali* interfaces. Reference:https://docs.tigera.io/calico/latest/reference/faq#why-do-all-cali-interfaces-have-the-mac-address-eeeeeeeeeeee

Fixes GH-4973

test_net failed in  the calico network env. error msg is:
                 msg = "duplicate mac found! both '%s' and '%s' have mac '%s'." % (
                     name,
                     ret[mac],
                     mac,
                 )
 >               raise RuntimeError(msg)
 E               RuntimeError: duplicate mac found! both 'caliaadab05bca1' and 'cali7a099f8b737' have mac 'ee:ee:ee:ee:ee:ee'.

 cloudinit/net/__init__.py:987: RuntimeError

 Checking the execution environment network interfaces information, it was found that multiple network interfaces's MAC addresses are ee:ee:ee:ee:ee:ee. eg:

 4789: calie4685cb71c1@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
     link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 5
     inet6 fe80::ecee:eeff:feee:eeee/64 scope link
        valid_lft forever preferred_lft forever
 4799: calib73b89e79e5@if10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1440 qdisc noqueue state UP group default
     link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netnsid 8
     inet6 fe80::ecee:eeff:feee:eeee/64 scope link
         valid_lft forever preferred_lft forever

 Q:Why do all cali* interfaces have the MAC address ee:ee:ee:ee:ee:ee?
 In some setups the kernel is unable to generate a persistent MAC address and so Calico assigns a MAC address itself. Since Calico uses point-to-point routed interfaces, traffic does not reach the data link layer so the MAC Address is never used and can therefore be the same for all the cali* interfaces. Reference:https://docs.tigera.io/calico/latest/reference/faq#why-do-all-cali-interfaces-have-the-mac-address-eeeeeeeeeeee
@catmsred
Copy link
Collaborator

Thank you for digging into the context of this issue and proposing a fix! It looks good to me.

@@ -984,6 +984,15 @@ def get_interfaces_by_mac_on_linux() -> dict:
)
continue

if name.startswith("cali") and mac == "ee:ee:ee:ee:ee:ee":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an associated network driver here unique to calico devices which we can check to align with other approaches just above this segment? Ideally we don't have to parse device name and specific mac match here.

We'd generally be able to find that driver via ls -l /sys/class/net/<calico_dev_name>/device/driver.

Also can we please provide an inline comment to link referencing: https://docs.tigera.io/calico/latest/reference/faq#why-do-all-cali-interfaces-have-the-mac-address-eeeeeeeeeeee for future us. when understanding the context of calico device behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an associated network driver here unique to calico devices which we can check to align with other approaches just above this segment? Ideally we don't have to parse device name and specific mac match here.

We'd generally be able to find that driver via ls -l /sys/class/net/<calico_dev_name>/device/driver.

What is wrong with the current approach? Because I can't get the environment which the problem occurs. It's a CI environment and I don't have permission to log in. I added the print code to the cloud-init source and then obtained the information from the public build log.

Also can we please provide an inline comment to link referencing: https://docs.tigera.io/calico/latest/reference/faq#why-do-all-cali-interfaces-have-the-mac-address-eeeeeeeeeeee for future us. when understanding the context of calico device behavior?

The URL is too long. If I add it to a comment or note, the CI may fail.

@blackboxsw blackboxsw self-assigned this Feb 28, 2024
@blackboxsw
Copy link
Collaborator

Generally speaking it'd also be good to extend our existing unittests here to assert the behavior and logs we expect. You can find the relevant unit test at tests/unittests/test_net.py::TestDuplicateMac::test_duplicate_ignored_macs

Copy link
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking minimally to have unittest coverage of this feature. Hopefully, we can key on the network driver name to use existing logic based on driver to ignore duplicate macs.

@holmanb
Copy link
Member

holmanb commented Feb 29, 2024

Hopefully, we can key on the network driver name to use existing logic based on driver to ignore duplicate macs.

Since calico is a network overlay, this isn't tied to any specific hardware driver. I assume that it uses veths, but wouldn't be surprised if it supports multiple virtual network devices.

@xiaoge1001
Copy link
Contributor Author

Generally speaking it'd also be good to extend our existing unittests here to assert the behavior and logs we expect. You can find the relevant unit test at tests/unittests/test_net.py::TestDuplicateMac::test_duplicate_ignored_macs

Sorry, I can't help with adding tests. I'm not familiar with mock and pytest modules. I only can use pytest to execute test cases and don't have the ability to write test cases using pytest and mock.

@TheRealFalcon
Copy link
Member

It's a CI environment and I don't have permission to log in. I added the print code to the cloud-init source and then obtained the information from the public build log.

We're happy to support an environment that cloud-init needs to run within, but is there a use case for adding this? If the problem is a unit test failure within a CI environment, then the root problem should be fixed by ensuring we have all the proper mocks in our unit tests, not changing the production code. If there is no actual production use case, then I'm -1 on this change.

@xiaoge1001
Copy link
Contributor Author

It's a CI environment and I don't have permission to log in. I added the print code to the cloud-init source and then obtained the information from the public build log.

We're happy to support an environment that cloud-init needs to run within, but is there a use case for adding this? If the problem is a unit test failure within a CI environment, then the root problem should be fixed by ensuring we have all the proper mocks in our unit tests, not changing the production code. If there is no actual production use case, then I'm -1 on this change.

But is it possible to have such actual production? I don't know much about it.

@blackboxsw
Copy link
Collaborator

Since this use-case really looks to be more of a test environment/setup than production use case at the moment, let's look to fix the leaked mocks to get you up and running. Basically our problem seems to be that a number of unittests and leaking calls to cloudinit.net.get_interfaces_by_mac which is breaking unittests in test environments where duplicate macs exist.
The following catches a bunch of those mocks to make sure our unittests aren't calling out to the host system and getting values that the unittests don't really care about.

While I think this approach is the right direction, I didn't complete this suggestion here because we have an open PR #4840 that is refactoring the test_net module to make the use of fixtures easier. So, for the moment, let's get #4840 reviewed merged, then we can look to appropriately mock the leaking calls to avoid the log failures you are seeing on openeuler test harness.

diff --git a/tests/unittests/conftest.py b/tests/unittests/conftest.py
index ed543deec..b95d76cf0 100644
--- a/tests/unittests/conftest.py
+++ b/tests/unittests/conftest.py
@@ -77,6 +77,20 @@ def disable_dns_lookup(request):
         yield
 
 
+@pytest.fixture()
+def get_interfaces_by_mac():
+    """
+    Avoid leaking calls to get_interfaces_by_mac to avoid potential unsupported
+    test environments with duplicate MAC addresses: GH-4973
+    """
+    with mock.patch(
+        "cloudinit.net.network_state.get_interfaces_by_mac",
+        return_value={},
+        autospec=True,
+    ):
+        yield
+
+
 @pytest.fixture()
 def dhclient_exists():
     with mock.patch(
diff --git a/tests/unittests/distros/test_netconfig.py b/tests/unittests/distros/test_netconfig.py
index 48690d712..ba30d112c 100644
--- a/tests/unittests/distros/test_netconfig.py
+++ b/tests/unittests/distros/test_netconfig.py
@@ -703,10 +703,14 @@ class TestNetCfgDistroRedhat(TestNetCfgDistroBase):
             raise ValueError("expected_cfg must not be None")
 
         tmpd = None
-        with mock.patch("cloudinit.net.sysconfig.available") as m_avail:
-            m_avail.return_value = True
-            with self.reRooted(tmpd) as tmpd:
-                apply_fn(config, bringup)
+        with mock.patch(
+            "cloudinit.net.network_state.get_interfaces_by_mac",
+            return_value={},
+        ):
+            with mock.patch("cloudinit.net.sysconfig.available") as m_avail:
+                m_avail.return_value = True
+                with self.reRooted(tmpd) as tmpd:
+                    apply_fn(config, bringup)
 
         results = dir2dict(tmpd)
         for cfgpath, expected in expected_cfgs.items():
@@ -1073,10 +1077,14 @@ class TestNetCfgDistroPhoton(TestNetCfgDistroBase):
             raise ValueError("expected_cfg must not be None")
 
         tmpd = None
-        with mock.patch("cloudinit.net.networkd.available") as m_avail:
-            m_avail.return_value = True
-            with self.reRooted(tmpd) as tmpd:
-                apply_fn(config, bringup)
+        with mock.patch(
+            "cloudinit.net.network_state.get_interfaces_by_mac",
+            return_value={},
+        ):
+            with mock.patch("cloudinit.net.networkd.available") as m_avail:
+                m_avail.return_value = True
+                with self.reRooted(tmpd) as tmpd:
+                    apply_fn(config, bringup)
 
         results = dir2dict(tmpd)
         for cfgpath, expected in expected_cfgs.items():
@@ -1198,10 +1206,14 @@ class TestNetCfgDistroMariner(TestNetCfgDistroBase):
             raise ValueError("expected_cfg must not be None")
 
         tmpd = None
-        with mock.patch("cloudinit.net.networkd.available") as m_avail:
-            m_avail.return_value = True
-            with self.reRooted(tmpd) as tmpd:
-                apply_fn(config, bringup)
+        with mock.patch(
+            "cloudinit.net.network_state.get_interfaces_by_mac",
+            return_value={},
+        ):
+            with mock.patch("cloudinit.net.networkd.available") as m_avail:
+                m_avail.return_value = True
+                with self.reRooted(tmpd) as tmpd:
+                    apply_fn(config, bringup)
 
         results = dir2dict(tmpd)
         for cfgpath, expected in expected_cfgs.items():
diff --git a/tests/unittests/net/test_network_state.py b/tests/unittests/net/test_network_state.py
index d2063231d..109a374d2 100644
--- a/tests/unittests/net/test_network_state.py
+++ b/tests/unittests/net/test_network_state.py
@@ -101,9 +101,16 @@ class TestNetworkStateParseConfig(CiTestCase):
         self.assertNotEqual(None, result)
 
 
-@mock.patch("cloudinit.net.network_state.get_interfaces_by_mac")
+@mock.patch(
+    "cloudinit.net.network_state.find_interface_name_from_mac", return_value={}
+)
+@mock.patch(
+    "cloudinit.net.network_state.get_interfaces_by_mac", return_value={}
+)
 class TestNetworkStateParseConfigV2:
-    def test_version_2_ignores_renderer_key(self, m_get_interfaces_by_mac):
+    def test_version_2_ignores_renderer_key(
+        self, m_get_interfaces_by_mac, m_find_interface_name_from_mac
+    ):
         ncfg = {"version": 2, "renderer": "networkd", "ethernets": {}}
         nsi = network_state.NetworkStateInterpreter(
             version=ncfg["version"], config=ncfg
@@ -205,7 +212,12 @@ class TestNetworkStateParseConfigV2:
         ],
     )
     def test_v2_warns_deprecated_gateways(
-        self, m_get_interfaces_by_mac, renderer_cls, cfg: str, caplog
+        self,
+        m_get_interfaces_by_mac,
+        m_find_interface_name_from_mac,
+        renderer_cls,
+        cfg: str,
+        caplog,
     ):
         """
         Tests that a v2 netconf with the deprecated `gateway4` or `gateway6`
diff --git a/tests/unittests/runs/test_simple_run.py b/tests/unittests/runs/test_simple_run.py
index eec2db00b..51f3fa880 100644
--- a/tests/unittests/runs/test_simple_run.py
+++ b/tests/unittests/runs/test_simple_run.py
@@ -2,6 +2,7 @@
 
 import copy
 import os
+from unittest import mock
 
 from cloudinit import atomic_helper, safeyaml, stages, util
 from cloudinit.config.modules import Modules
@@ -74,7 +75,11 @@ class TestSimpleRun(helpers.FilesystemMockingTestCase):
         self.assertFalse(
             os.path.exists("/var/lib/cloud/instance/network-config.json")
         )
-        initer.apply_network_config(False)
+        with mock.patch(
+            "cloudinit.net.network_state.get_interfaces_by_mac",
+            return_value={},
+        ):
+            initer.apply_network_config(False)
         self.assertEqual(
             f"{atomic_helper.json_dumps(netcfg)}\n",
             util.load_text_file("/var/lib/cloud/instance/network-config.json"),
diff --git a/tests/unittests/sources/test_oracle.py b/tests/unittests/sources/test_oracle.py
index 0f0d90118..da477316c 100644
--- a/tests/unittests/sources/test_oracle.py
+++ b/tests/unittests/sources/test_oracle.py
@@ -220,7 +220,10 @@ class TestIsPlatformViable:
     mock.Mock(return_value=False),
 )
 class TestNetworkConfigFromOpcImds:
-    def test_no_secondary_nics_does_not_mutate_input(self, oracle_ds):
+    @mock.patch(DS_PATH + ".get_interfaces_by_mac", return_value={})
+    def test_no_secondary_nics_does_not_mutate_input(
+        self, m_get_interfaces_by_mac, oracle_ds
+    ):
         oracle_ds._vnics_data = [{}]
         # We test this by using in a non-dict to ensure that no dict
         # operations are used; failure would be seen as exceptions
@@ -322,7 +325,10 @@ class TestNetworkConfigFromOpcImds:
         "set_primary",
         [True, False],
     )
-    def test_secondary_nic_v2(self, set_primary, oracle_ds):
+    @mock.patch(DS_PATH + ".get_interfaces_by_mac", return_value={})
+    def test_secondary_nic_v2(
+        self, m_get_interfaces_by_mac, set_primary, oracle_ds
+    ):
         oracle_ds._vnics_data = json.loads(OPC_VM_SECONDARY_VNIC_RESPONSE)
         oracle_ds._network_config = {
             "version": 2,
diff --git a/tests/unittests/test_net.py b/tests/unittests/test_net.py
index cb9919388..0ab028766 100644
--- a/tests/unittests/test_net.py
+++ b/tests/unittests/test_net.py
@@ -4842,8 +4842,12 @@ class TestGenerateFallbackConfig(CiTestCase):
     @mock.patch("cloudinit.net.sys_dev_path")
     @mock.patch("cloudinit.net.read_sys_net")
     @mock.patch("cloudinit.net.get_devicelist")
+    @mock.patch(
+        "cloudinit.net.network_state.get_interfaces_by_mac", return_value={}
+    )
     def test_device_driver(
         self,
+        mock_get_interfaces_by_mac,
         mock_get_devicelist,
         mock_read_sys_net,
         mock_sys_dev_path,
@@ -4934,8 +4938,12 @@ iface eth0 inet6 dhcp
     @mock.patch("cloudinit.net.sys_dev_path")
     @mock.patch("cloudinit.net.read_sys_net")
     @mock.patch("cloudinit.net.get_devicelist")
+    @mock.patch(
+        "cloudinit.net.network_state.get_interfaces_by_mac", return_value={}
+    )
     def test_hv_netvsc_vf_filter(
         self,
+        mock_get_interfaces_by_mac,
         mock_get_devicelist,
         mock_read_sys_net,
         mock_sys_dev_path,
@@ -5144,7 +5152,11 @@ class TestRhelSysConfigRendering(CiTestCase):
             dir = self.tmp_dir()
 
         if network_config:
-            ns = network_state.parse_net_config_data(network_config)
+            with mock.patch(
+                "cloudinit.net.network_state.get_interfaces_by_mac",
+                return_value={},
+            ):
+                ns = network_state.parse_net_config_data(network_config)
         elif state:
             ns = state
         else:
@@ -5198,8 +5210,12 @@ class TestRhelSysConfigRendering(CiTestCase):
     @mock.patch("cloudinit.net.sys_dev_path")
     @mock.patch("cloudinit.net.read_sys_net")
     @mock.patch("cloudinit.net.get_devicelist")
+    @mock.patch(
+        "cloudinit.net.network_state.get_interfaces_by_mac", return_value={}
+    )
     def test_default_generation(
         self,
+        mock_get_interfaces_by_mac,
         mock_get_devicelist,
         mock_read_sys_net,
         mock_sys_dev_path,
@@ -6081,7 +6097,11 @@ class TestOpenSuseSysConfigRendering(CiTestCase):
             dir = self.tmp_dir()
 
         if network_config:
-            ns = network_state.parse_net_config_data(network_config)
+            with mock.patch(
+                "cloudinit.net.network_state.get_interfaces_by_mac",
+                return_value={},
+            ):
+                ns = network_state.parse_net_config_data(network_config)
         elif state:
             ns = state
         else:
@@ -6477,7 +6497,11 @@ class TestNetworkManagerRendering(CiTestCase):
             dir = self.tmp_dir()
 
         if network_config:
-            ns = network_state.parse_net_config_data(network_config)
+            with mock.patch(
+                "cloudinit.net.network_state.get_interfaces_by_mac",
+                return_value={},
+            ):
+                ns = network_state.parse_net_config_data(network_config)
         elif state:
             ns = state
         else:
@@ -7838,7 +7862,11 @@ class TestNetplanRoundTrip(CiTestCase):
             target = self.tmp_dir()
 
         if network_config:
-            ns = network_state.parse_net_config_data(network_config)
+            with mock.patch(
+                "cloudinit.net.network_state.get_interfaces_by_mac",
+                return_value={},
+            ):
+                ns = network_state.parse_net_config_data(network_config)
         elif state:
             ns = state
         else:
@@ -8066,7 +8094,11 @@ class TestEniRoundTrip(CiTestCase):
             dir = self.tmp_dir()
 
         if network_config:
-            ns = network_state.parse_net_config_data(network_config)
+            with mock.patch(
+                "cloudinit.net.network_state.get_interfaces_by_mac",
+                return_value={},
+            ):
+                ns = network_state.parse_net_config_data(network_config)
         elif state:
             ns = state
         else:
@@ -8491,7 +8523,11 @@ class TestNetworkdRoundTrip(CiTestCase):
             dir = self.tmp_dir()
 
         if network_config:
-            ns = network_state.parse_net_config_data(network_config)
+            with mock.patch(
+                "cloudinit.net.network_state.get_interfaces_by_mac",
+                return_value={},
+            ):
+                ns = network_state.parse_net_config_data(network_config)
         elif state:
             ns = state
         else:
diff --git a/tests/unittests/test_net_activators.py b/tests/unittests/test_net_activators.py
index d53701efa..9759b7790 100644
--- a/tests/unittests/test_net_activators.py
+++ b/tests/unittests/test_net_activators.py
@@ -281,7 +281,12 @@ class TestActivatorsBringUp:
 
     @patch("cloudinit.subp.subp", return_value=("", ""))
     def test_bring_up_all_interfaces_v2(
-        self, m_subp, activator, expected_call_list, available_mocks
+        self,
+        m_subp,
+        activator,
+        expected_call_list,
+        available_mocks,
+        get_interfaces_by_mac,
     ):
         network_state = parse_net_config_data(load(V2_CONFIG))
         activator.bring_up_all_interfaces(network_state)

@blackboxsw
Copy link
Collaborator

Upstream has just merged #5012 which we believe should solve the duplicate MAC issue you were seeing in unittest runs. If this issue persists, please reopen issue #4973

@blackboxsw blackboxsw closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_net test fail in the calico network env
5 participants