Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The ephemeral network doesn't work for IPv6 only network with RHEL on AWS #4540

Closed
xiachen-rh opened this issue Oct 23, 2023 · 3 comments
Closed
Assignees
Labels
bug Something isn't working correctly priority Fix soon

Comments

@xiachen-rh
Copy link
Contributor

Bug report

Launching a CentOS Stream/RHEL machine in AWS into an IPv6-Only subnet results in an unusable instance due to missing ipv4 routes.
[Root Cause]
The ephemeral network doesn't work for IPv6 only network with CentOS Stream/RHEL.
As IPv4 is not workable in IPv6 only network, cloud-init raised an exception of ProcessExecutionError when calling _bringup_static_routes of EphemeralIPv4Network in EphemeralDHCPv4, which led to skipping crawl_metadata and running into DataSourceNotFoundException.

Issue filed in redhat https://issues.redhat.com/browse/RHEL-7278
[RHEL-8]Launching EC2 Instance in IPv6-Only subnet leads to unreachable instance

Steps to reproduce the problem

  1. Create a VPC with and IPv6 CIDR block (using either your own or Amazon's IPv6 address space)
  2. Create an IPv6 only subnet by creating a new subnet and checking the "IPv6 Only" box
  3. Create an instance and associate it with the IPv6 capable VPC and the IPv6-only subnet.
    aws --region *** run-instances --instance-type *** --image-id *** --subnet-id subnet-*** --key-name *** --ipv6-address-count 1 --security-group-ids ***--metadata-options "HttpEndpoint=enabled,HttpProtocolIpv6=enabled" --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=rhel8-ipv6only}]'
  4. After approximately 10 minutes, the instance will complete the boot process, but will have "1/2 checks passed" in the "Status Check" column, and "Instance reachability check failed" in the "Status Check" tab of the instance details section.

Environment details

cloud-init logs

2023-10-17 08:05:30,204 - ephemeral.py[DEBUG]: Attempting setup of ephemeral network on eth0 with 169.254.255.155/32 brd 0.0.0.0
2023-10-17 08:05:30,204 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'add', '169.254.255.155/32', 'broadcast', '0.0.0.0', 'dev', 'eth0']
with allowed return codes [0] (shell=False, capture=True)
2023-10-17 08:05:30,208 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'up'] with allowed return codes [0] (shell=False, capture=True)
2023-10-17 08:05:30,213 - subp.py[DEBUG]: Running command ['ip', '-4', 'route',
'append', '169.254.169.253/32', 'via', '169.254.0.1', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2023-10-17 08:05:30,217 - ephemeral.py[ERROR]: Error bringing up EphemeralIPv4Network. Datasource setup cannot continue
2023-10-17 08:05:30,217 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'link', 'set', 'dev', 'eth0', 'down'] with allowed return codes [0] (shell=False, capture=True)
2023-10-17 08:05:30,221 - subp.py[DEBUG]: Running command ['ip', '-family', 'inet', 'addr', 'del', '169.254.255.155/32', 'dev', 'eth0'] with allowed return codes [0] (shell=False, capture=True)
2023-10-17 08:05:30,225 - handlers.py[DEBUG]: finish: init-local/search-Ec2Local: FAIL: no local data found from DataSourceEc2Local
2023-10-17 08:05:30,225 - util.py[WARNING]: Getting data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2Local'> failed
2023-10-17 08:05:30,226 - util.py[DEBUG]: Getting data from <class 'cloudinit.sources.DataSourceEc2.DataSourceEc2Local'> failed
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/cloudinit/sources/init.py", line 999, in find_source
[EventType.BOOT_NEW_INSTANCE]
File "/usr/lib/python3.6/site-packages/cloudinit/sources/init.py", line 880, in update_metadata_if_supported
result = self.get_data()
File "/usr/lib/python3.6/site-packages/cloudinit/sources/DataSourceEc2.py", line 722, in get_data
return super(DataSourceEc2Local, self).get_data()
File "/usr/lib/python3.6/site-packages/cloudinit/sources/init.py", line 422, in get_data
return_value = self._check_and_get_data()
File "/usr/lib/python3.6/site-packages/cloudinit/sources/init.py", line 357, in _check_and_get_data
return self._get_data()
File "/usr/lib/python3.6/site-packages/cloudinit/sources/DataSourceEc2.py", line 137, in _get_data
tmp_dir=self.distro.get_tmp_exec_path(),
File "/usr/lib/python3.6/site-packages/cloudinit/net/ephemeral.py", line 470, in enter
EphemeralDHCPv4(self.interface, tmp_dir=self.tmp_dir)
File "/usr/lib64/python3.6/contextlib.py", line 330, in enter_context
result = _cm_type.enter(cm)
File "/usr/lib/python3.6/site-packages/cloudinit/net/ephemeral.py", line 364, in enter
return self.obtain_lease()
File "/usr/lib/python3.6/site-packages/cloudinit/net/ephemeral.py", line 424, in obtain_lease
ephipv4.enter()
File "/usr/lib/python3.6/site-packages/cloudinit/net/ephemeral.py", line 120, in enter
self._bringup_static_routes()
File "/usr/lib/python3.6/site-packages/cloudinit/net/ephemeral.py", line 237, in _bringup_static_routes
capture=True,
File "/usr/lib/python3.6/site-packages/cloudinit/subp.py", line 336, in subp
stdout=out, stderr=err, exit_code=rc, cmd=args
cloudinit.subp.ProcessExecutionError: Unexpected error while running command.
Command: ['ip', '-4', 'route', 'append', '169.254.169.253/32', 'via', '169.254.0.1', 'dev', 'eth0']
Exit code: 2
Reason: -
Stdout:
Stderr: Error: Nexthop has invalid gateway.
2023-10-17 08:05:30,232 - main.py[DEBUG]: No local datasource found

@xiachen-rh xiachen-rh added bug Something isn't working correctly new An issue that still needs triage labels Oct 23, 2023
@xiachen-rh
Copy link
Contributor Author

xiachen-rh commented Oct 23, 2023

I have two ideas, and I tested that both of them can work.

  1. add self.stack.enter_context(EphemeralIPv6Network(self.interface)) when handling the exception.
class EphemeralIPv4Network:
    def __enter__(self):

            if self.static_routes:
                self._bringup_static_routes()
            elif self.router:
                self._bringup_router()
        except subp.ProcessExecutionError:
            LOG.error(
                "Error bringing up EphemeralIPv4Network. "
                "Datasource setup cannot continue with IPv4"    < ---change the log
            )
            self.__exit__(None, None, None)
            raise
class EphemeralIPNetwork:
    def __enter__(self):
        # ipv6 dualstack might succeed when dhcp4 fails
        # therefore catch exception unless only v4 is used
        try:
            if self.ipv4:
                self.stack.enter_context(
                    EphemeralDHCPv4(self.interface, tmp_dir=self.tmp_dir)
                )
            #if self.ipv6:
                #self.stack.enter_context(EphemeralIPv6Network(self.interface))
        # v6 link local might be usable
        # caller may want to log network state
        except (NoDHCPLeaseError,subp.ProcessExecutionError) as e:           <---catch ProcessExecutionError
            if self.ipv6:
                self.state_msg = "using link-local ipv6"
                self.stack.enter_context(EphemeralIPv6Network(self.interface))    <---Add this line
            else:
                raise e
        return self

Or
2. remove the two lines about "exit and raise exception" in /cloudinit/net/ephemeral.py. Without exception, stack.close() can clean the work.

class EphemeralIPv4Network:
    def __enter__(self):

            if self.static_routes:
                self._bringup_static_routes()
            elif self.router:
                self._bringup_router()
        except subp.ProcessExecutionError:
            LOG.error(
                "Error bringing up EphemeralIPv4Network. "
                "Datasource setup cannot continue with IPv4"    <--- change the log
            )
            #self.__exit__(None, None, None)             <---remove these two lines
            #raise

Or do you have any better solution?

@blackboxsw
Copy link
Collaborator

Thank you for reflecting this bug upstream and making cloud-init better. I agree, from log this does look like an issue in upstream handling of ephemeral IPv4 network setup in ipv6-only environments. Our integration tests only validate that IPv6 support works on a dual-stack deployed VM in ec2 with iptables forcing reject of the available IPV4 to allow fallback to ipv6. I think this does represent something we need to address and probably needs a bit more investigation on the approach to take here. Adding priority so we can work this angle and fix coverage for ipv6-only deployments.

@blackboxsw blackboxsw added priority Fix soon and removed new An issue that still needs triage labels Oct 24, 2023
holmanb added a commit to holmanb/cloud-init that referenced this issue Oct 24, 2023
holmanb added a commit to holmanb/cloud-init that referenced this issue Oct 24, 2023
holmanb added a commit to holmanb/cloud-init that referenced this issue Oct 24, 2023
@holmanb holmanb self-assigned this Oct 24, 2023
holmanb added a commit to holmanb/cloud-init that referenced this issue Oct 24, 2023
holmanb added a commit to holmanb/cloud-init that referenced this issue Oct 24, 2023
EphemeralIPv{4,6} failure is not always an error, therefore do not log
this event as an error in the context manager. Allow callsites to
determine log level.

Fixes canonicalGH-4540
holmanb added a commit to holmanb/cloud-init that referenced this issue Oct 24, 2023
EphemeralIPv{4,6} failure is not always an error, therefore do not log
this event as an error in the context manager. Allow callsites to
determine log level.

Fixes canonicalGH-4540
@holmanb
Copy link
Member

holmanb commented Oct 30, 2023

This bug also affects openstack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working correctly priority Fix soon
Projects
None yet
Development

No branches or pull requests

3 participants