-
Notifications
You must be signed in to change notification settings - Fork 911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ec2 IPv6 IMDS Support (SC-336) #1160
Conversation
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging mitechie, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag mitechie to reopen it.) |
25dc4e1
to
ddad201
Compare
Edit: It seems the distro-provided version of httpretty causes issues with some tests, switching to pip version (newer) resolved this issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @holmanb . This has come together really nicely. I have just a few small comments.
Do we need to change the call on line 250 of DataSourceEc2.py to be asynchronous as well?
I think the start_time
and status_cb
assignments in wait_for_url
might be better placed under all of the nested functions.
I went back and forth about how abstract this is. On one hand, the code is elegant and well structured. On the other hand, it can be hard to follow the flow through the layers of callbacks, partials, and closures. But then again, unwinding it all or adding types might cause it to be even harder to follow. I think at the end of the day, this is just a more functional approach, and it's really just a few functions, so I think it's fine.
Have you tested this on a live ec2 instance yet? I think it'd be worth ensuring we can fetch metadata in both ipv4-only and ipv6-only environments. If not, we should do that before we merge this. Can you post your test steps in the PR description so they can be easily reproduced in the future?
I think it'd be worth figuring out how to instrument pycloudlib to launch an ipv6 instance so we can write an integration test, but that doesn't need to come with this PR...though it'd be nice before our next SRU.
Thank you :)
I debated this myself, and honestly I still don't know the right answer. I need a better understanding of the semantics of AWS tokens. I should have noted this in the comments when I switched to requesting review, apologies. I do know that if we do make the PUT on line 250 async, the assertion at
Agreed
My sentiments exactly. I should note that since this code already had lots of callbacks it didn't seem too out of place when writing it. That said, I can always restructure to something less functional if needed. One concern I have with doing that is that there would likely be more code duplication and resulting potential for subtle differences in behavior between the sync/parallel implementations.
I have not yet. Since I haven't heard too strong of objections on this yet I'll get started on that next.
I like that idea. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great work. Please add a fairly detailed suggested commit message in the description of the PR as it merits a lot more context about the design/architecture here.
Also a pointer to the RFC 6555 link in the commit message will be helpful.
cloudinit/url_helper.py
Outdated
return_exception, | ||
returned_address, | ||
) | ||
raise return_exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we want to specifically look at requests.ConnectionError failures which basically tell us no route to host for IPv6 and ignore them as no route means we really aren't ever going to succeed there right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question. I don't think that "no route to host" should ever happen (because we're using link-local addresses[1]), but maybe I'm wrong. Generally the idea of "ignore connection failures in case the others succeed" is something I can get behind.
[1] To my knowledge Linux doesn't require fib entries for sending directly to a link-local addresses.
A quick experiment:
[ec2-user@ip-172-31-6-134 ~]$ ip r
default via 172.31.0.1 dev eth0
169.254.169.254 dev eth0
172.31.0.0/20 dev eth0 proto kernel scope link src 172.31.6.134
[ec2-user@ip-172-31-6-134 ~]$ sudo ip r del 169.254.169.254
[ec2-user@ip-172-31-6-134 ~]$ ip route show table all | grep -e 169 -e ec2
[ec2-user@ip-172-31-6-134 ~]$ wget http://169.254.169.254
--2022-01-25 23:39:54-- http://169.254.169.254/
Connecting to 169.254.169.254:80... connected.
HTTP request sent, awaiting response... 200 OK
Even though the link local route was removed from the routing table it was still able to communicate using that address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh maybe I was misinterpreting the log I saw on an IPv4 only instance:
2022-01-25 22:34:46,195 - url_helper.py[WARNING]: Calling 'None' failed [0/120s]: request error [HTTPConnectionPool(host='fd00:ec2::254', port=80): Max retries exceeded with url: /latest/api/token (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7facb9c98400>: Failed to establish a new connection: [Errno 101] Network is unreachable'))]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context. "Network is unreachable" is often related to routing issues, I see what you were getting at. Regardless of what "Network is unreachable" is caused by, I think this suggestion is a good one. Specifically this would make connectivity possible in the case that there is a slow first address and an unreachable second address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting question. I don't think that "no route to host" should ever happen (because we're using link-local addresses[1]), but maybe I'm wrong.
Update: I was wrong.
My former argument here was invalid. The link local address used on ec2 instances is actually the destination address, not source address. Http(s) traffic leaves via the default route on a private address (RFC1918).
Remove default routes and this error could happen:
[ec2-user@ip-172-31-6-134 ~]$ wget http://[fd00:ec2::254]/latest/meta-data/
--2022-02-10 22:36:17-- http://[fd00:ec2::254]/latest/meta-data/
Connecting to [fd00:ec2::254]:80... failed: Network is unreachable.
I still think that the route I deleted in the example above (169.254.169.254 dev eth0
) is unneeded, but for different reasons.
This comment doesn't imply a different course of action, but I wanted to set the record straight here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to set request changes for moving the connect_synchronously up into _maybe_fetch_api_token.
3679de8
to
e70ad7e
Compare
pycloudlib does have capacity to launch dual-stack ipv4/ipv6 vpc (though this may not yet against exclusive IPv6 IMDS) via something like: vpc = ec2.get_or_create_vpc(name="cloud-nit-dualstack-vpc")
ec2.launch(image_id, vpc=vpc) So I think we can subclass tests.integrationtests.cloud.Ec2Cloud._perform_launch to call the get_or_create_vpc first in cases we want dual-stack support for testing |
25d232e
to
0d890d4
Compare
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging TheRealFalcon, and he will ensure that someone takes a look soon. (If the pull request is closed and you would like to continue working on it, please do tag TheRealFalcon to reopen it.) |
d3f2b45
to
4a99f79
Compare
The failing test is an annoying bug that was fixed a few years ago: Here's a workaround: diff --git a/tests/unittests/sources/test_ec2.py b/tests/unittests/sources/test_ec2.py
index 3848924d..9e373632 100644
--- a/tests/unittests/sources/test_ec2.py
+++ b/tests/unittests/sources/test_ec2.py
@@ -530,6 +530,14 @@ class TestEc2(test_helpers.HttprettyTestCase):
md={"md": old_metadata},
)
self.assertTrue(ds.get_data())
+
+ # Workaround https://github.com/getsentry/responses/issues/212
+ if hasattr(responses.mock, "_urls"):
+ for index, url in enumerate(responses.mock._urls):
+ if url["url"].startswith(
+ "http://169.254.169.254/2009-04-04/meta-data/"
+ ):
+ del responses.mock._urls[index]
# Provide new revision of metadata that contains network data
register_mock_metaserver(
"http://169.254.169.254/2009-04-04/meta-data/", DEFAULT_METADATA Since it's only one test, I'm fine just inserting it there. If this becomes a bigger issue we'd probably want to find a more elegant solution. |
@TheRealFalcon nice! Also matches my findings about responses 0.17.0 being the first version without this issue. Maybe worth adding a line to the comment explaining that the workaround can be removed once there will be no need for testing with requests < 0.17.0 (that is, post-Focal EOL)? |
Yeah, good idea, though I think the workaround is only necessary in Bionic. |
cdafd46
to
d1f9222
Compare
Thanks! This works locally so I have included it. Also, I had to rebase on the tip of main since the latest commit introduced a merge conflict in one of the tests I changed. |
Previously the 403 test verified that the following output was logged: "PUT /latest/api/token HTTP/1.1" 403 0 This log output is from urllib3.connectionpool, and is redundant with the other checked logs, one of which is only logged when code==403. We don't get this log currently when run under dual-stack, so drop it from the test.
This allows a ConnectionError exception to be thrown in one of the threads without canceling the other thread - the first successful call is returned, the rest are canceled. If all threads have exceptions, log them all and raise the last one (need choose one somehow, right? I chose the last; tell me if there is a reason we should raise the first in order of execution or first in alphabetical order or whatever) Tests to verify the above described behavior have been added.
…nfigured addresses)
- type hints - make sure response gets sent to optional time callback, because that's what it used to do - don't assign lamda functions to kwargs, who does that
I believe this is ready for re-review. |
I made an integration test for this. If you use canonical/pycloudlib#188 and apply this patch, it should work diff --git a/tests/integration_tests/clouds.py b/tests/integration_tests/clouds.py
index e5fe56d1..5ea41cbb 100644
--- a/tests/integration_tests/clouds.py
+++ b/tests/integration_tests/clouds.py
@@ -207,10 +207,14 @@ class Ec2Cloud(IntegrationCloud):
def _perform_launch(self, launch_kwargs, **kwargs):
"""Use a dual-stack VPC for cloud-init integration testing."""
- launch_kwargs["vpc"] = self.cloud_instance.get_or_create_vpc(
- name="ec2-cloud-init-integration"
- )
+ if "vpc" not in launch_kwargs:
+ launch_kwargs["vpc"] = self.cloud_instance.get_or_create_vpc(
+ name="ec2-cloud-init-integration"
+ )
+ if "Ipv6AddressCount" not in launch_kwargs:
+ launch_kwargs["Ipv6AddressCount"] = 1
pycloudlib_instance = self.cloud_instance.launch(**launch_kwargs)
+ pycloudlib_instance.enable_ipv6_instance_metadata()
return pycloudlib_instance
diff --git a/tests/integration_tests/datasources/test_ec2.py b/tests/integration_tests/datasources/test_ec2.py
new file mode 100644
index 00000000..8cde4dc9
--- /dev/null
+++ b/tests/integration_tests/datasources/test_ec2.py
@@ -0,0 +1,43 @@
+import re
+
+import pytest
+
+from tests.integration_tests.instances import IntegrationInstance
+
+
+def _test_crawl(client, ip):
+ assert client.execute("cloud-init clean --logs").ok
+ assert client.execute("cloud-init init --local").ok
+ log = client.read_from_file("/var/log/cloud-init.log")
+ assert f"Using metadata source: '{ip}'" in log
+ result = re.findall(
+ r"Crawl of metadata service took (\d+.\d+) seconds", log
+ )
+ if len(result) != 1:
+ pytest.fail(f"Expected 1 metadata crawl time, got {result}")
+ # 20 would still be a crazy long time for metadata service to crawl,
+ # but it's short enough to know we're not waiting for a response
+ assert float(result[0]) < 20
+
+
+@pytest.mark.ec2
+def test_dual_stack(client: IntegrationInstance):
+ # Drop IPv4 responses
+ assert client.execute("iptables -I INPUT -s 169.254.169.254 -j DROP").ok
+ _test_crawl(client, "http://[fd00:ec2::254]")
+
+ # Block IPv4 requests
+ assert client.execute("iptables -I OUTPUT -d 169.254.169.254 -j REJECT").ok
+ _test_crawl(client, "http://[fd00:ec2::254]")
+
+ # Re-enable IPv4
+ assert client.execute("iptables -D OUTPUT -d 169.254.169.254 -j REJECT").ok
+ assert client.execute("iptables -D INPUT -s 169.254.169.254 -j DROP").ok
+
+ # Drop IPv6 responses
+ assert client.execute("ip6tables -I INPUT -s fd00:ec2::254 -j DROP").ok
+ _test_crawl(client, "http://169.254.169.254")
+
+ # Block IPv6 requests
+ assert client.execute("ip6tables -I OUTPUT -d fd00:ec2::254 -j REJECT").ok
+ _test_crawl(client, "http://169.254.169.254") It'll default all of our ec2 tests to being dual-stack with ipv6 metadata enabled, but I don't see any reason that would be a problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments inline, but otherwise looks good. Once we get the dual-stack timeout fixed and the integration test passes, we should be good to go.
Additionally, some tickets we should create for follow-up:
- Add
pytest-responses
toBuild-Depends
of all the packaging branches. I also plan on addingpytest-mock
, so we could do those at the same time. - Replace httpretty with responses everywhere. No need to have 2 libraries doing the same thing.
- Ensure our EphemeralDHCPv4 code gets updated accordingly for an IPv6-only environment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me now. Thanks for all the work you put into this!
- Add openstack IPv6 metadata url fe80::a9fe:a9fe - Enable requesting multiple metadata sources in parallel This PR is very similar to #1160, reusing the provided `url_heper` logic. LP: #1906849
Additional Context
In contrast to "happy eyeballs", this implementation supports parallel attempts to communicate with a small delay favoring ipv4. This is datasource-specific, and other datasources that wish favor ipv6 can do so trivially.
Technical Description of Changes:
url_helper.py:dual_stack()
) that does 0.150s delay, cancels pending responses, etc (basically a stateless generic happy eyeballs implementation for synchronous libraries but at the http layer, not the socket/connection layer)wait_for_url()
into reusable pieces to support optionally parallel http requests to endpoints