Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net/dhcp: handle timeouts for dhcpcd #5022

Merged
merged 2 commits into from
Mar 8, 2024
Merged

Conversation

cjp256
Copy link
Contributor

@cjp256 cjp256 commented Mar 7, 2024

I'm not sure 10 seconds is enough of a timeout. Azure will retry DHCP failures, but I expect the reduction from 300s/60s -> 10s will result in unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than relying on subprocess timeout? Something to consider.

In the meantime:

  • catch the TimeoutExpired exception and raise NoDhcpLeaseError to address below traceback

  • update the "dhclient" logs to "dhcpcd"

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds

I'm not sure 10 seconds is enough of a timeout.  Azure will retry DHCP
failures, but I expect the reduction from 300s/60s -> 10s will result in
unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than
relying on subprocess timeout?  Something to consider.

In the meantime:

- catch the TimeoutExpired exception and raise NoDhcpLeaseError to
  address below traceback

- update the "dhclient" logs to "dhcpcd"

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds
```

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
@cjp256
Copy link
Contributor Author

cjp256 commented Mar 7, 2024

linkcheck CI looks broken

@holmanb holmanb self-assigned this Mar 7, 2024
Copy link
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjp256 Thanks for this fix! We're putting together a 24.1.1 due to some feedback received on early testing, I think this makes sense for inclusion.

I'm not sure 10 seconds is enough of a timeout. Azure will retry DHCP failures, but I expect the reduction from 300s/60s -> 10s will result in unexpected failures for other clouds where there is no retry.

Oof. Thanks for correcting this behavior, I missed the TimeoutExpired exception.

I see that dhcpcd offers a --timeout option which might be better than relying on subprocess timeout? Something to consider.

I didn't feel comfortable using it due to NetworkConfiguration/dhcpcd#276. If carrier is not acquired, it would prevent retry (block indefinitely). Manually retrying via subprocess in this case would be more reliable, I think.

linkcheck CI looks broken

I'll get that fixed.

tests/unittests/net/test_dhcp.py Outdated Show resolved Hide resolved
tests/unittests/net/test_dhcp.py Show resolved Hide resolved
@cjp256
Copy link
Contributor Author

cjp256 commented Mar 7, 2024

There may be some issues with dhcpcd retry fwiw based on #5023 - will see for sure after we flight this change

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
@cjp256
Copy link
Contributor Author

cjp256 commented Mar 7, 2024

unit test failure looks unrelated:

___________________ TestDualStack.test_dual_stack_staggered ____________________

self = <tests.unittests.test_url_helper.TestDualStack object at 0x7fded11a9d30>

    def test_dual_stack_staggered(self):
        """Assert expected call intervals occur"""
        stagger = 0.1
        with mock.patch(M_PATH + "_run_func_with_delay") as delay_func:
            dual_stack(
                lambda x, _y: x,
                ["you", "and", "me", "and", "dog"],
                stagger_delay=stagger,
                timeout=1,
            )
    
            # ensure that stagger delay for each subsequent call is:
            # [ 0 * N, 1 * N, 2 * N, 3 * N, 4 * N, 5 * N] where N = stagger
            # it appears that without an explicit wait/join we can't assert
            # number of calls
            for delay, call_item in enumerate(delay_func.call_args_list):
                _, kwargs = call_item
>               assert stagger * delay == kwargs.get("delay")
E               AssertionError: assert (0.1 * 1) == 0.2
E                +  where 0.2 = <built-in method get of dict object at 0x7fde941d7500>('delay')
E                +    where <built-in method get of dict object at 0x7fde941d7500> = {'addr': 'me', 'delay': 0.2, 'event': <threading.Event at 0x7fde841eb680: set>, 'func': <function TestDualStack.test_dual_stack_staggered.<locals>.<lambda> at 0x7fde84248f40>, ...}.get

_          = ()
call_item  = call(func=<function TestDualStack.test_dual_stack_staggered.<locals>.<lambda> at 0x7fde84248f40>, addr='me', timeout=1, event=<threading.Event at 0x7fde841eb680: set>, delay=0.2)
delay      = 1
delay_func = <MagicMock name='_run_func_with_delay' id='140593676830480'>
kwargs     = {'addr': 'me',
 'delay': 0.2,
 'event': <threading.Event at 0x7fde841eb680: set>,
 'func': <function TestDualStack.test_dual_stack_staggered.<locals>.<lambda> at 0x7fde84248f40>,
 'timeout': 1}
self       = <tests.unittests.test_url_helper.TestDualStack object at 0x7fded11a9d30>
stagger    = 0.1

tests/unittests/test_url_helper.py:461: AssertionError

@holmanb holmanb added the 24.1 label Mar 8, 2024
Copy link
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for this fix!

@holmanb holmanb merged commit faeca64 into canonical:main Mar 8, 2024
29 checks passed
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Mar 9, 2024
I'm not sure 10 seconds is enough of a timeout.  Azure will retry DHCP
failures, but I expect the reduction from 300s/60s -> 10s will result in
unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than
relying on subprocess timeout?  Something to consider.

In the meantime:

- catch the TimeoutExpired exception and raise NoDhcpLeaseError to
  address below traceback

- update the "dhclient" logs to "dhcpcd"

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds
```

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Mar 9, 2024
I'm not sure 10 seconds is enough of a timeout.  Azure will retry DHCP
failures, but I expect the reduction from 300s/60s -> 10s will result in
unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than
relying on subprocess timeout?  Something to consider.

In the meantime:

- catch the TimeoutExpired exception and raise NoDhcpLeaseError to
  address below traceback

- update the "dhclient" logs to "dhcpcd"

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds
```

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Mar 9, 2024
I'm not sure 10 seconds is enough of a timeout.  Azure will retry DHCP
failures, but I expect the reduction from 300s/60s -> 10s will result in
unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than
relying on subprocess timeout?  Something to consider.

In the meantime:

- catch the TimeoutExpired exception and raise NoDhcpLeaseError to
  address below traceback

- update the "dhclient" logs to "dhcpcd"

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds
```

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
@holmanb holmanb mentioned this pull request Mar 9, 2024
1 task
holmanb pushed a commit that referenced this pull request Mar 11, 2024
I'm not sure 10 seconds is enough of a timeout.  Azure will retry DHCP
failures, but I expect the reduction from 300s/60s -> 10s will result in
unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than
relying on subprocess timeout?  Something to consider.

In the meantime:

- catch the TimeoutExpired exception and raise NoDhcpLeaseError to
  address below traceback

- update the "dhclient" logs to "dhcpcd"

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds
```

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Mar 12, 2024
I'm not sure 10 seconds is enough of a timeout.  Azure will retry DHCP
failures, but I expect the reduction from 300s/60s -> 10s will result in
unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than
relying on subprocess timeout?  Something to consider.

In the meantime:

- catch the TimeoutExpired exception and raise NoDhcpLeaseError to
  address below traceback

- update the "dhclient" logs to "dhcpcd"

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds
```

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
holmanb pushed a commit that referenced this pull request Mar 12, 2024
I'm not sure 10 seconds is enough of a timeout.  Azure will retry DHCP
failures, but I expect the reduction from 300s/60s -> 10s will result in
unexpected failures for other clouds where there is no retry.

I see that dhcpcd offers a --timeout option which might be better than
relying on subprocess timeout?  Something to consider.

In the meantime:

- catch the TimeoutExpired exception and raise NoDhcpLeaseError to
  address below traceback

- update the "dhclient" logs to "dhcpcd"

```
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 851, in _get_data
    crawled_data = util.log_time(
                   ^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/util.py", line 2825, in log_time
    ret = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 630, in crawl_metadata
    self._setup_ephemeral_networking(timeout_minutes=timeout_minutes)
  File "/usr/lib/python3/dist-packages/cloudinit/sources/helpers/azure.py", line 45, in impl
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/sources/DataSourceAzure.py", line 440, in _setup_ephemeral_networking
    lease = self._ephemeral_dhcp_ctx.obtain_lease()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/ephemeral.py", line 288, in obtain_lease
    self.lease = maybe_perform_dhcp_discovery(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 102, in maybe_perform_dhcp_discovery
    return distro.dhcp_client.dhcp_discovery(interface, dhcp_log_func, distro)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/net/dhcp.py", line 649, in dhcp_discovery
    out, err = subp.subp(
               ^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/cloudinit/subp.py", line 276, in subp
    out, err = sp.communicate(data, timeout=timeout)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1209, in communicate
    stdout, stderr = self._communicate(input, endtime, timeout)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2114, in _communicate
    self._check_timeout(endtime, orig_timeout, stdout, stderr)
  File "/usr/lib/python3.12/subprocess.py", line 1253, in _check_timeout
    raise TimeoutExpired(
subprocess.TimeoutExpired: Command '[b'/usr/sbin/dhcpcd', b'--ipv4only', b'--waitip', b'--persistent', b'--noarp', b'--script=/bin/true', b'eth0']' timed out after 10 seconds
```

Signed-off-by: Chris Patterson <cpatterson@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants