Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataSourceNoCloudNet not configurable via config files #5288

Closed
mgollo opened this issue May 13, 2024 · 26 comments
Closed

DataSourceNoCloudNet not configurable via config files #5288

mgollo opened this issue May 13, 2024 · 26 comments
Labels
fixed in main The reported issue has already been fixed in the main branch.

Comments

@mgollo
Copy link

mgollo commented May 13, 2024

Bug report

When DataSourceNoCloudNet had not yet been forked off DataSourceNoCloud, the following config would allow loading the config file via HTTP without kernel command line parameters or SMBIOS serial:

datasource_list: [NoCloud]

datasource:
  NoCloud:
    seedfrom: http://someserver/userdata/

This configuration was tested and working with RHEL 8.8 VMware VM templates with an injected network configuration and cloud-init 22.1. In the old version of DataSourceNoCloud, the configuration would be read, the URL would be called and the configuration applied.
In RHEL 9.4 with cloud-init 23.4 (also if the latest DataSourceNoCloud.py from main branch was manually applied), this configuration leads to cloud-init detecting DataSourceNoCloud (as before), which no longer supports http* seedfrom URLs. It would have to detect DataSourceNoCloudNet, but the ds_detect function of the DataSourceNoCloudNet class only checks SMBIOS serials and the kernel command line.
Therefore DataSourceNoCloudNet cannot be configured via /etc/cloud.

Steps to reproduce the problem

Use the above config snippet in /etc/cloud/cloud.cfg.d/10_foreman.cfg and no SMBIOS serial or kernel command line.

Environment details

  • Cloud-init version: 23.4 and later
  • Operating System Distribution: Redhat Enterprise Linux 9.4
  • Cloud provider, platform or installer type: VMware vSphere with VM templates

cloud-init logs

DataSourceNoCloud.py[DEBUG]: Seed from http://someserver/userdata/ not supported by DataSourceNoCloud [seed=None][dsmode=net]
@mgollo mgollo added bug Something isn't working correctly new An issue that still needs triage labels May 13, 2024
@ani-sinha
Copy link
Contributor

I believe the following change introduced the change in behavior

commit 612b4de892d19333c33276d541fed99fd16d3998
Author: Brett Holman <brett.holman@canonical.com>
Date:   Fri Mar 31 15:24:09 2023 -0600

    Standardize kernel commandline user interface (#2093)
    
    - deprecate ci.ds= and ci.datasource= in favor of ds=
    - enable semi-colon-delimited datasource everywhere
    - add support for case-insensitive datasource match
    - add integration tests
    

@dermotbradley
Copy link
Contributor

I just tested the same seed configuration (which I have used in the past) using cloud-init "git master" and it works as expected.

I do see the same log message as mentioned in the issue report but this appears during mode "init-local":

2024-05-13 15:40:30,651 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.alpine.Distro'>
2024-05-13 15:40:30,652 - sources[DEBUG]: Looking for data source in: ['NoCloud'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM']
2024-05-13 15:40:30,679 - sources[DEBUG]: Searching for local data source in: ['DataSourceNoCloud']
2024-05-13 15:40:30,680 - handlers.py[DEBUG]: start: init-local/search-NoCloud: searching for local data from DataSourceNoCloud
...
...
2024-05-13 15:40:30,726 - DataSourceNoCloud.py[DEBUG]: Seed from http://myserver.mynetwork/seed/ not supported by DataSourceNoCloud [seed=None][dsmode=net]
2024-05-13 15:40:30,726 - sources[DEBUG]: Datasource DataSourceNoCloud [seed=None][dsmode=net] not updated for events: boot-new-instance
2024-05-13 15:40:30,727 - handlers.py[DEBUG]: finish: init-local/search-NoCloud: SUCCESS: no local data found from DataSourceNoCloud
2024-05-13 15:40:30,726 - sources[DEBUG]: Datasource DataSourceNoCloud [seed=None][dsmode=net] not updated for events: boot-new-instance
2024-05-13 15:40:30,727 - handlers.py[DEBUG]: finish: init-local/search-NoCloud: SUCCESS: no local data found from DataSourceNoCloud
2024-05-13 15:40:30,727 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2024-05-13 15:40:30,728 - main.py[DEBUG]: No local datasource found
2024-05-13 15:40:30,729 - util.py[DEBUG]: Reading from /sys/class/net/lo/address (quiet=False)

however during mode "local-network" things work as expected:

2024-05-13 15:40:30,840 - networking.py[DEBUG]: net: all expected physical devices present
2024-05-13 15:40:30,840 - stages.py[DEBUG]: applying net config names for {'ethernets': {'eth0': {'dhcp4': True, 'dhcp6': True, 'set-name': 'eth0', 'match': {'macaddress': '08:00:27:98:f0:10'}}}, 'version': 2}
...
...
2024-05-13 15:40:30,852 - subp.py[DEBUG]: Running command ['ip', '-6', 'addr', 'show', 'permanent', 'scope', 'global'] with allowed return codes [0] (shell=False, capture=True)
2024-05-13 15:40:30,916 - subp.py[DEBUG]: Running command ['ip', '-4', 'addr', 'show'] with allowed return codes [0] (shell=False, capture=True)
2024-05-13 15:40:30,922 - net[DEBUG]: Detected interfaces {'lo': {'downable': True, 'device_id': None, 'driver': None, 'mac': '00:00:00:00:00:00', 'name': 'lo', 'up': False}, 'eth0': {'downable': True, 'device_id': '0x0001', 'driver': 'virtio_net', 'mac': '08:00:27:98:f0:10', 'name': 'eth0', 'up': False}}
2024-05-13 15:40:30,923 - net[DEBUG]: no work necessary for renaming of [['08:00:27:98:f0:10', 'eth0', 'virtio_net', '0x0001']]
2024-05-13 15:40:30,923 - stages.py[INFO]: Applying network configuration from fallback bringup=False: {'ethernets': {'eth0': {'dhcp4': True, 'dhcp6': True, 'set-name': 'eth0', 'match': {'macaddress': '08:00:27:98:f0:10'}}}, 'version': 2}
2024-05-13 15:40:30,924 - util.py[DEBUG]: Writing to /run/cloud-init/sem/apply_network_config.once - wb: [644] 25 bytes
2024-05-13 15:40:30,933 - distros[DEBUG]: Selected renderer 'eni' from priority list: ['eni']
...
...
2024-05-13 15:40:30,943 - network_state.py[DEBUG]: v2(ethernets) -> v1(physical): {'type': 'physical', 'mac_address': '08:00:27:98:f0:10', 'name': 'eth0', 'match': {'macaddress': '08:00:27:98:f0:10'}, 'subnets': [{'type': 'dhcp4'}, {'type': 'dhcp6'}]}
2024-05-13 15:40:30,954 - network_state.py[DEBUG]: v2_common: handling config: {'eth0': {'dhcp4': True, 'dhcp6': True, 'set-name': 'eth0', 'match': {'macaddress': '08:00:27:98:f0:10'}}}
2024-05-13 15:40:30,956 - util.py[DEBUG]: Writing to /etc/network/interfaces - wb: [644] 418 bytes
2024-05-13 15:40:30,962 - util.py[DEBUG]: Writing to /etc/udev/rules.d/70-persistent-net.rules - wb: [644] 96 bytes
2024-05-13 15:40:30,964 - distros[DEBUG]: Not bringing up newly configured network interfaces
2024-05-13 15:40:30,964 - main.py[DEBUG]: [local] Exiting without datasource
...
...
2024-05-13 15:40:30,968 - util.py[DEBUG]: cloud-init mode 'init' took 0.633 seconds (0.64)
2024-05-13 15:40:30,969 - handlers.py[DEBUG]: finish: init-local: SUCCESS: searching for local datasources
2024-05-13 15:40:40,294 - util.py[DEBUG]: Cloud-init v. 24.2 running 'init' at Mon, 13 May 2024 15:40:40 +0000. Up 26.28 seconds.
2024-05-13 15:40:40,294 - main.py[INFO]: PID [1823] started cloud-init.
2024-05-13 15:40:40,295 - main.py[DEBUG]: No kernel command line url found.
2024-05-13 15:40:40,295 - main.py[DEBUG]: Closing stdin.
2024-05-13 15:40:40,301 - util.py[DEBUG]: Writing to /var/log/cloud-init.log - ab: [640] 0 bytes
2024-05-13 15:40:40,303 - util.py[DEBUG]: Changing the ownership of /var/log/cloud-init.log to 0:4
2024-05-13 15:40:40,304 - subp.py[DEBUG]: Running command ['ip', '--json', 'addr'] with allowed return codes [0] (shell=False, capture=True)
2024-05-13 15:40:40,313 - subp.py[DEBUG]: Running command ['ip', '-o', 'route', 'list'] with allowed return codes [0] (shell=False, capture=True)
2024-05-13 15:40:40,319 - subp.py[DEBUG]: Running command ['ip', '--oneline', '-6', 'route', 'list', 'table', 'all'] with allowed return codes [0, 1] (shell=False, capture=True)
2024-05-13 15:40:40,342 - handlers.py[DEBUG]: start: init-network/check-cache: attempting to read from cache [trust]
2024-05-13 15:40:40,343 - util.py[DEBUG]: Reading from /var/lib/cloud/instance/obj.pkl (quiet=False)
2024-05-13 15:40:40,343 - stages.py[DEBUG]: no cache found
2024-05-13 15:40:40,343 - handlers.py[DEBUG]: finish: init-network/check-cache: SUCCESS: no cache found
2024-05-13 15:40:40,363 - stages.py[DEBUG]: Using distro class <class 'cloudinit.distros.alpine.Distro'>
2024-05-13 15:40:40,364 - sources[DEBUG]: Looking for data source in: ['NoCloud'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM', 'NETWORK']
2024-05-13 15:40:40,369 - sources[DEBUG]: Searching for network data source in: ['DataSourceNoCloudNet']
2024-05-13 15:40:40,370 - handlers.py[DEBUG]: start: init-network/search-NoCloudNet: searching for network data from DataSourceNoCloudNet
2024-05-13 15:40:40,370 - sources[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'>
2024-05-13 15:40:40,371 - sources[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2024-05-13 15:40:40,373 - sources[DEBUG]: Machine is configured to run on single datasource DataSourceNoCloudNet [seed=None][dsmode=net].
2024-05-13 15:40:40,373 - dmi.py[DEBUG]: querying dmi data /sys/class/dmi/id/product_serial
...
...
2024-05-13 15:40:40,423 - subp.py[DEBUG]: Running command ['blkid', '-tLABEL_FAT BOOT=cidata', '-odevice'] with allowed return codes [0, 2] (shell=False, capture=True)
2024-05-13 15:40:40,430 - url_helper.py[DEBUG]: [0/11] open 'http://myserver.mynetwork/seed/meta-data' with {'url': 'http://myserver.mynetwork/seed/meta-data', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/24.2'}} configuration
2024-05-13 15:40:40,471 - url_helper.py[DEBUG]: Read from http://myserver.mynetwork/seed/meta-data (200, 233b) after 1 attempts
2024-05-13 15:40:40,471 - util.py[DEBUG]: Attempting to load yaml from string of length 233 with allowed root types (<class 'dict'>,)
2024-05-13 15:40:40,474 - url_helper.py[DEBUG]: [0/11] open 'http://myserver.mynetwork/seed/user-data' with {'url': 'http://myserver.mynetwork/seed/user-data', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/24.2'}} configuration
2024-05-13 15:40:40,490 - url_helper.py[DEBUG]: Read from http://myserver.mynetwork/seed/user-data (200, 1612b) after 1 attempts
2024-05-13 15:40:40,491 - url_helper.py[DEBUG]: [0/11] open 'http://myserver.mynetwork/seed/vendor-data' with {'url': 'http://myserver.mynetwork/seed/vendor-data', 'stream': False, 'allow_redirects': True, 'method': 'GET', 'headers': {'User-Agent': 'Cloud-Init/24.2'}} configuration
2024-05-13 15:40:40,509 - url_helper.py[DEBUG]: Please wait 1 seconds while we wait to try again
...
...
2024-05-13 15:40:50,720 - util.py[DEBUG]: Error in vendor-data response: 404 Client Error: Not Found for url: http://myserver.mynetwork/seed/vendor-data
2024-05-13 15:40:50,722 - DataSourceNoCloud.py[DEBUG]: Using seeded cache data from http://myserver.mynetwork/seed/
2024-05-13 15:40:50,728 - util.py[DEBUG]: Reading from /etc/hosts (quiet=False)
2024-05-13 15:40:50,729 - util.py[DEBUG]: Read 79 bytes from /etc/hosts
2024-05-13 15:40:50,747 - util.py[DEBUG]: Writing to /run/cloud-init/cloud-id-nocloud - wb: [644] 8 bytes
2024-05-13 15:40:50,751 - util.py[DEBUG]: Creating symbolic link from '/run/cloud-init/cloud-id' => '/run/cloud-init/cloud-id-nocloud'
2024-05-13 15:40:50,758 - atomic_helper.py[DEBUG]: Atomically writing to file /run/cloud-init/instance-data-sensitive.json (via temporary file /run/cloud-init/tmpg3fha7qg) - w: [600] 8920 bytes/chars
2024-05-13 15:40:50,763 - atomic_helper.py[DEBUG]: Atomically writing to file /run/cloud-init/instance-data.json (via temporary file /run/cloud-init/tmpzwap7df2) - w: [644] 1889 bytes/chars
2024-05-13 15:40:50,765 - handlers.py[DEBUG]: finish: init-network/search-NoCloudNet: SUCCESS: found network data from DataSourceNoCloudNet
2024-05-13 15:40:50,765 - util.py[DEBUG]: Attempting to remove /var/lib/cloud/instance
2024-05-13 15:40:50,767 - stages.py[INFO]: Loaded datasource DataSourceNoCloudNet - DataSourceNoCloudNet [seed=ds_config_seedfrom,http://myserver.mynetwork/seed/][dsmode=net]
...
...
2024-05-13 15:40:51,048 - handlers.py[DEBUG]: start: init-network/setup-datasource: setting up datasource
2024-05-13 15:40:51,049 - handlers.py[DEBUG]: finish: init-network/setup-datasource: SUCCESS: setting up datasource

@holmanb
Copy link
Member

holmanb commented May 13, 2024

Thanks for filing this issue. @mgollo can you please test to see if this issue persists on 24.1? We already fixed some bugs in this space that I believe may have fixed this issue.

with cloud-init 23.4 (also if the latest DataSourceNoCloud.py from main branch was manually applied), this configuration leads to cloud-init detecting DataSourceNoCloud (as before), which no longer supports http* seedfrom URLs. It would have to detect DataSourceNoCloudNet, but the ds_detect function of the DataSourceNoCloudNet class only checks SMBIOS serials and the kernel command line.
Therefore DataSourceNoCloudNet cannot be configured via /etc/cloud.

Great analysis @mgollo. I think that this explains what you are seeing correctly.

DataSourceNoCloud.py[DEBUG]: Seed from http://someserver/userdata/ not supported by DataSourceNoCloud [seed=None][dsmode=net]

This is a red herring. The log is telling the truth, but this isn't where the code change that caused this happens. You will see this on 24.1 today as well, where I believe this issue is already fixed.

Relevant discussions and doc references :

https://github.com/canonical/cloud-init/pull/5147/files#diff-46389feb50580360c7f8da93ae0491b59f8ab08919cb98b2336ee28e51d7fc70R40-R52

@sayan3296 The docs that you link to describe relevant changes, but I don't think that they describe the one that that caused this.

#5165 (comment)

That comment isn't relevant to this issue. That comment was describing a situation in which users may have had an invalid configuration previously which silently worked before which would no longer will.

I believe the following change introduced the change in behavior

commit 612b4de

You aren't far off, but I don't think that this is where the change was actually introduced. The reason is that when 612b4de landed, ds_detect() would only get called if ds-identify didn't report a "single datasource", which was might take the form of [NoCloud, None] or [NoCloud]. Since ds-identify automatically added None to the list in all cases that a datasource was found, that the configuration consumed by cloud-init's python code would have [NoCloud, None]. However, 76b7b38 was proposed to fix this behavior since skipping detection broke some edge cases when users might explicitly pass [Azure, None] and want this to detect which platform it was running on. That commit is the actual source of the change of behavior I think. In cdbbd17 we made ds-identify stop automatically adding None to the datasource_list, so this means that we should once again be not running ds_detect() under the configuration that you've provided. That all is to say: "I think that it's fixed in main".

Aside

When DataSourceNoCloudNet had not yet been forked off DataSourceNoCloud

I wouldn't describe this as "forked off". All that really happened is that the user interface was simplified so that cloud-init could discern from the seedfrom value which code path is the correct one to use. The two codepaths were already separate, so really "merged in" might be a better description of that change.

@holmanb
Copy link
Member

holmanb commented May 13, 2024

@dermotbradley I didn't see your message until after I posted. Thanks for independently confirming that this is indeed already fixed in upstream cloud-init.

@holmanb holmanb added fixed in main The reported issue has already been fixed in the main branch. and removed bug Something isn't working correctly new An issue that still needs triage labels May 13, 2024
@ani-sinha
Copy link
Contributor

You aren't far off, but I don't think that this is where the change was actually introduced. The reason is that when 612b4de landed, ds_detect() would only get called if ds-identify didn't report a "single datasource", which was might take the form of [NoCloud, None] or [NoCloud]. Since ds-identify automatically added None to the list in all cases that a datasource was found, that the configuration consumed by cloud-init's python code would have [NoCloud, None]. However, 76b7b38 was proposed to fix this behavior since skipping detection broke some edge cases when users might explicitly pass [Azure, None] and want this to detect which platform it was running on. That commit is the actual source of the change of behavior I think. In cdbbd17 we made ds-identify stop automatically adding None to the datasource_list, so this means that we should once again be not running ds_detect() under the configuration that you've provided. That all is to say: "I think that it's fixed in main".

wow! What a convoluted mess. We are checking if cdbbd17ae400e4 fixed it for us.

@mgollo
Copy link
Author

mgollo commented May 14, 2024

I have just manually installed the latest main branch on a test VM and I can confirm that the issue is resolved there.

@sayan3296
Copy link

Hello @mgollo , Thanks for confirming. Other than applying the patch [ using latest main branch ], Did you need to change anything in the cloud-init config of the template itself ?

I mean, Do you have a similar /etc/cloud/cloud.cfg.d/10_datasource.cfg config as specified in the foreman docs or it needs tweak ?

@sayan3296
Copy link

@ani-sinha I confirm b489e4f is the specific change that fixes the issue .. Please see if this can be backported.

@mgollo
Copy link
Author

mgollo commented May 14, 2024

@sayan3296
No I used the same config that worked with cloud-init 22.1 on RHEL8 before. The config is set up according to the Foreman/Redhat Satellite docs. The network configuration is done via vmware-tools in our case and cloud-init is only used to download the config generated from a Foreman template.

@holmanb
Copy link
Member

holmanb commented May 14, 2024

wow! What a convoluted mess. We are checking if cdbbd17ae400e4 fixed it for us.

@ani-sinha Agreed, and very sorry for the churn :-/ These fixes stemmed from the requirement to support OpenStack on bare metal.

@mgollo @sayan3296 Thanks for confirming. This commit should be included in the 24.1.x upstream release. I'll put up a PR. Nevermind this change is already in 24.1.x: 0ae4728.

@holmanb holmanb closed this as completed May 14, 2024
@ani-sinha
Copy link
Contributor

ani-sinha commented May 21, 2024

wow! What a convoluted mess. We are checking if cdbbd17ae400e4 fixed it for us.

@ani-sinha Agreed, and very sorry for the churn :-/ These fixes stemmed from the requirement to support OpenStack on bare metal.

@holmanb
Ok there seems to be one more regression. When datasource_list is empty, in override_ds_detect(), it does not hit this condition:

elif self.sys_cfg.get("datasource_list", []) == [self.dsname]:
           LOG.debug(
               "Machine is configured to run on single datasource %s.", self
           )
           return True

Therefore, ds_detect() is called for DataSourceNoCloudNet which would return false if kernel commandline or dmi serial number does not have nocloud-net. This change came from

diff --git a/cloudinit/sources/DataSourceNoCloud.py b/cloudinit/sources/DataSourceNoCloud.py
index a32bd4d08..596a96a78 100644
--- a/cloudinit/sources/DataSourceNoCloud.py
+++ b/cloudinit/sources/DataSourceNoCloud.py
@@ -357,6 +357,14 @@ class DataSourceNoCloudNet(DataSourceNoCloud):
         DataSourceNoCloud.__init__(self, sys_cfg, distro, paths)
         self.supported_seed_starts = ("http://", "https://")
 
+    def ds_detect(self):
+        """NoCloud requires "nocloud-net" as the way to specify
+        seeding from an http(s) address. This diverges from all other
+        datasources in that it does a kernel commandline match on something
+        other than the datasource dsname for only DEP_NETWORK.
+        """
+        return "nocloud-net" == sources.parse_cmdline()
+

which is part of the commit

commit 612b4de892d19333c33276d541fed99fd16d3998
Author: Brett Holman <brett.holman@canonical.com>
Date:   Fri Mar 31 15:24:09 2023 -0600

    Standardize kernel commandline user interface (#2093)
    
    - deprecate ci.ds= and ci.datasource= in favor of ds=
    - enable semi-colon-delimited datasource everywhere
    - add support for case-insensitive datasource match
    - add integration tests

which was subsequently overwriten by

commit 66b5ce9d5f94c0c6625972fdfdca3796d365b069
Author: Brett Holman <brett.holman@canonical.com>
Date:   Fri Feb 23 11:16:15 2024 -0700

    fix(nocloud): smbios datasource definition
    
    deprecate nocloud-net name

So now we positively need a non-empty list for datasource_list if we want to use DataSourceNoCloud. Previously it was not the case. This is a (significant) deviation from past behavior and our customers are confused.

Previous case:

2024-05-20 09:51:17,149 - __init__.py[DEBUG]: Searching for network data source in: ['DataSourceNoCloudNet', 'DataSourceAltCloud', 'DataSourceOVFNet', 'DataSourceMAAS', 'DataSourceGCE', 'DataSourceOpenStack', 'DataSourceAliYun', 'DataSourceEc2', 'DataSourceCloudStack', 'DataSourceBigstep', 'DataSourceExoscale', 'DataSourceUpCloud', 'DataSourceVMware', 'DataSourceNone']
2024-05-20 09:51:17,149 - handlers.py[DEBUG]: start: init-network/search-NoCloudNet: searching for network data from DataSourceNoCloudNet
2024-05-20 09:51:17,149 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'>
2024-05-20 09:51:17,150 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2024-05-20 09:51:17,150 - __init__.py[DEBUG]: Machine is running on DataSourceNoCloudNet [seed=None][dsmode=net].

Now:

2024-05-20 10:42:14,125 - __init__.py[DEBUG]: Searching for network data source in: ['DataSourceNoCloudNet', 'DataSourceAltCloud', 'DataSourceOVFNet', 'DataSourceMAAS', 'DataSourceGCE', 'DataSourceOpenStack', 'DataSourceAliYun', 'DataSourceEc2', 'DataSourceCloudStack', 'DataSourceBigstep', 'DataSourceExoscale', 'DataSourceUpCloud', 'DataSourceVMware', 'DataSourceAkamai', 'DataSourceNone']
2024-05-20 10:42:14,125 - handlers.py[DEBUG]: start: init-network/search-NoCloudNet: searching for network data from DataSourceNoCloudNet
2024-05-20 10:42:14,125 - __init__.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'>
2024-05-20 10:42:14,125 - __init__.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2024-05-20 10:42:14,126 - __init__.py[DEBUG]: Datasource type DataSourceNoCloudNet [seed=None][dsmode=net] is not detected.
2024-05-20 10:42:14,126 - __init__.py[DEBUG]: Datasource DataSourceNoCloudNet [seed=None][dsmode=net] not updated for events: boot-new-instance
2024-05-20 10:42:14,127 - handlers.py[DEBUG]: finish: init-network/search-NoCloudNet: SUCCESS: no network data found from DataSourceNoCloudNet

@TheRealFalcon
Copy link
Member

@ani-sinha , can you help me understand the use case here a bit more? Do you know how your customers are passing their NoCloud data to the instance? You are correct that ds_detect() only checks for kernel command line and DMI, but ds-identify runs at generator timeframe and also checks for the seed dir and filesystem label. If it finds either, it will write datasource_list: [ LXD, None ] to /run/cloud-init/cloud.cfg, which will take precedence over anything under /etc/cloud. As far as I can tell, that covers all of NoCloud's supported use cases.

Based on your logs, seeing many datasources listed from the Python code, it appears that ds-identify isn't running, may be customized, or may be misconfigured. Do you know why that might be? We should be able to improve what you're asking for in the python code, but I suspect there may be larger problems if ds-identify is not running correctly.

@zhaohuijuan
Copy link

zhaohuijuan commented May 21, 2024

@TheRealFalcon , below is the details about the regression issue.
No such issue in cloud-init-23.1.1 and the previous versions, but meet it after upgrade to cloud-init-23.4.

Bug description:
cloud-init not finding DataSourceNoCloud after update from cloud-init-23.1.1 to 23.4

Using the loopback device to insert an cloudinit.iso and trigger cloud-init like below, no network data found from DataSourceNoCloudNet.

modprobe loop
losetup /dev/loop0 /media/cloudinit.iso
ln -s /dev/loop0 /dev/sr0
cloud-init init && cloud-init modules -m config && cloud-init modules -m final

Note that it works if inserted as CD into the VM and reboot.
It does not work via the loopback device. We are doing this as cloudinit.iso is just a file on the overall installation iso.

Test steps:

  1. Customize rhel-guest-image to config the root password
  2. Deploy VM with the rhel-guest image in step 1
  3. Make datasource iso cloudinit.iso
    $ genisoimage -o cloudinit.iso -V cidata -r -J meta-data user-data

$ cat meta-data
instance-id: kvm-cloudinit
local-hostname: kvm-cloudinit-test

$ cat user-data
#cloud-config
users:
- name: cloud-user
lock_passwd: false
groups: adm, systemd-journal
sudo: ALL=(ALL) NOPASSWD:ALL
ssh_authorized_keys:
- ssh-rsa xxxxxx
chpasswd:
users:
- name: root
password: xxxx
type: text
- name: cloud-user
password: xxxx
type: text
expire: False
ssh_pwauth: True
disable_root: False

  1. Login VM, and scp cloudinit.iso to the VM
  2. Using the loopback device to insert the cloudinit.iso and trigger cloud-init with:
    modprobe loop
    losetup /dev/loop0 /media/cloudinit.iso
    ln -s /dev/loop0 /dev/sr0
    cloud-init init && cloud-init modules -m config && cloud-init modules -m final

Actual results:
Failed to detect DataSourceNoCloudNet.
cloud-init.log
---
2024-05-20 10:42:13,967 - init.py[DEBUG]: Looking for data source in: ['NoCloud', 'ConfigDrive', 'LXD', 'OpenNebula', 'DigitalOcean', 'Azure', 'AltCloud', 'OVF', 'MAAS', 'GCE', 'OpenStack', 'AliYun', 'Vultr', 'Ec2', 'CloudSigma', 'CloudStack', 'SmartOS', 'Bigstep', 'Scaleway', 'Hetzner', 'IBMCloud', 'Oracle', 'Exoscale', 'RbxCloud', 'UpCloud', 'VMware', 'NWCS', 'Akamai', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM', 'NETWORK']
2024-05-20 10:42:14,125 - init.py[DEBUG]: Searching for network data source in: ['DataSourceNoCloudNet', 'DataSourceAltCloud', 'DataSourceOVFNet', 'DataSourceMAAS', 'DataSourceGCE', 'DataSourceOpenStack', 'DataSourceAliYun', 'DataSourceEc2', 'DataSourceCloudStack', 'DataSourceBigstep', 'DataSourceExoscale', 'DataSourceUpCloud', 'DataSourceVMware', 'DataSourceAkamai', 'DataSourceNone']
2024-05-20 10:42:14,125 - handlers.py[DEBUG]: start: init-network/search-NoCloudNet: searching for network data from DataSourceNoCloudNet
2024-05-20 10:42:14,125 - init.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'>
2024-05-20 10:42:14,125 - init.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2024-05-20 10:42:14,126 - init.py[DEBUG]: Datasource type DataSourceNoCloudNet [seed=None][dsmode=net] is not detected.
2024-05-20 10:42:14,126 - init.py[DEBUG]: Datasource DataSourceNoCloudNet [seed=None][dsmode=net] not updated for events: boot-new-instance
2024-05-20 10:42:14,127 - handlers.py[DEBUG]: finish: init-network/search-NoCloudNet: SUCCESS: no network data found from DataSourceNoCloudNet
---

Additional info:
In cloud-init-23.1.1-11, no such issue with the same test steps
$ cloud-init status --long
...
last_update: Mon, 20 May 2024 09:51:20 +0000
detail:
DataSourceNoCloudNet [seed=/dev/loop0][dsmode=net]

cloud-init.log:
---
2024-05-20 09:51:16,987 - init.py[DEBUG]: Looking for data source in: ['NoCloud', 'ConfigDrive', 'LXD', 'OpenNebula', 'DigitalOcean', 'Azure', 'AltCloud', 'OVF', 'MAAS', 'GCE', 'OpenStack', 'AliYun', 'Vultr', 'Ec2', 'CloudSigma', 'CloudStack', 'SmartOS', 'Bigstep', 'Scaleway', 'Hetzner', 'IBMCloud', 'Oracle', 'Exoscale', 'RbxCloud', 'UpCloud', 'VMware', 'NWCS', 'None'], via packages ['', 'cloudinit.sources'] that matches dependencies ['FILESYSTEM', 'NETWORK']
2024-05-20 09:51:17,149 - init.py[DEBUG]: Searching for network data source in: ['DataSourceNoCloudNet', 'DataSourceAltCloud', 'DataSourceOVFNet', 'DataSourceMAAS', 'DataSourceGCE', 'DataSourceOpenStack', 'DataSourceAliYun', 'DataSourceEc2', 'DataSourceCloudStack', 'DataSourceBigstep', 'DataSourceExoscale', 'DataSourceUpCloud', 'DataSourceVMware', 'DataSourceNone']
2024-05-20 09:51:17,149 - handlers.py[DEBUG]: start: init-network/search-NoCloudNet: searching for network data from DataSourceNoCloudNet
2024-05-20 09:51:17,149 - init.py[DEBUG]: Seeing if we can get any data from <class 'cloudinit.sources.DataSourceNoCloud.DataSourceNoCloudNet'>
2024-05-20 09:51:17,150 - init.py[DEBUG]: Update datasource metadata and network config due to events: boot-new-instance
2024-05-20 09:51:17,150 - init.py[DEBUG]: Machine is running on DataSourceNoCloudNet [seed=None][dsmode=net].
---

@ani-sinha
Copy link
Contributor

ani-sinha commented May 22, 2024

Using the loopback device to insert an cloudinit.iso and trigger cloud-init like below, no network data found from DataSourceNoCloudNet.

modprobe loop losetup /dev/loop0 /media/cloudinit.iso ln -s /dev/loop0 /dev/sr0 cloud-init init && cloud-init modules -m config && cloud-init modules -m final

Another data point is, as per Method 1 in https://cloudinit.readthedocs.io/en/latest/reference/datasources/nocloud.html#configuration-methods we have

# lsblk -o name,mountpoint,label,FSTYPE,size,uuid /dev/sr0
NAME  MOUNTPOINT LABEL  FSTYPE   SIZE UUID
loop0            cidata iso9660  366K 2023-12-07-03-29-18-00

The problem here is _get_data() in DataSourceNoCloud is not even getting called because this if condition is False in override_ds_detect() :

   elif self.ds_detect():
            LOG.debug(
                "Detected platform: %s. Checking for active instance data",
                self,
            )
            return self._get_data()

@ani-sinha
Copy link
Contributor

Using the loopback device to insert an cloudinit.iso and trigger cloud-init like below, no network data found from DataSourceNoCloudNet.
modprobe loop losetup /dev/loop0 /media/cloudinit.iso ln -s /dev/loop0 /dev/sr0 cloud-init init && cloud-init modules -m config && cloud-init modules -m final

Another data point is, as per Method 1 in https://cloudinit.readthedocs.io/en/latest/reference/datasources/nocloud.html#configuration-methods we have

# lsblk -o name,mountpoint,label,FSTYPE,size,uuid /dev/sr0
NAME  MOUNTPOINT LABEL  FSTYPE   SIZE UUID
loop0            cidata iso9660  366K 2023-12-07-03-29-18-00

The problem here is _get_data() in DataSourceNoCloud is not even getting called because this if condition is False in override_ds_detect() :

   elif self.ds_detect():
            LOG.debug(
                "Detected platform: %s. Checking for active instance data",
                self,
            )
            return self._get_data()

Some more observations ..
Both in the working and non-working case, looking at /run/cloud-init/ds-identify.log I see

DMI_PRODUCT_NAME=KVM
DMI_SYS_VENDOR=Red Hat
DMI_PRODUCT_SERIAL=
DMI_PRODUCT_UUID=930112bd-4d83-47bd-b4c4-3bbf184f84ab
PID_1_PRODUCT_NAME=unavailable
DMI_CHASSIS_ASSET_TAG=
DMI_BOARD_NAME=RHEL-AV
FS_LABELS=root,boot
ISO9660_DEVS=
KERNEL_CMDLINE=BOOT_IMAGE=(hd0,gpt3)/vmlinuz-5.14.0-427.13.1.el9_4.x86_64 root=UUID=e8ee85f8-75b0-452c-be0a-8fda3bb86581 console=tty0 console=ttyS0,115200n8 no_timer_check net.ifnames=0 crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M
VIRT=kvm
UNAME_KERNEL_NAME=Linux
UNAME_KERNEL_RELEASE=5.14.0-427.13.1.el9_4.x86_64
UNAME_KERNEL_VERSION=#1 SMP PREEMPT_DYNAMIC Wed Apr 10 10:29:16 EDT 2024
UNAME_MACHINE=x86_64
UNAME_NODENAME=localhost
UNAME_OPERATING_SYSTEM=GNU/Linux
DSNAME=
DSLIST=MAAS ConfigDrive NoCloud AltCloud Azure Bigstep CloudSigma CloudStack DigitalOcean Vultr AliYun Ec2 GCE OpenNebula OpenStack OVF SmartOS Scaleway Hetzner IBMCloud Oracle Exoscale RbxCloud UpCloud VMware LXD NWCS Akamai
MODE=search
ON_FOUND=all
ON_MAYBE=all
ON_NOTFOUND=disabled
pid=526 ppid=502
is_container=false
is_ds_enabled(IBMCloud) = true.
ec2 platform is 'Unknown'.
is_ds_enabled(IBMCloud) = true.
No ds found [mode=search, notfound=disabled]. Disabled cloud-init [1]
[up 3.64s] returning 1

This means FS_LABELS are not identified correctly. Also in both cases, I see

# cat /var/run/cloud-init/cloud.cfg 
di_report:
  datasource_list: [  ]
  # reporting not found result. notfound=disabled.

Also, on both cases, blkid does show it correctly:


# blkid -c /dev/null -o export
DEVNAME=/dev/loop0
UUID=2023-12-07-03-29-18-00
LABEL=cidata
TYPE=iso9660

DEVNAME=/dev/vda4
LABEL=root
UUID=e8ee85f8-75b0-452c-be0a-8fda3bb86581
TYPE=xfs
PARTUUID=6264d520-3fb9-423f-8ab8-7a0a8e3d3562

DEVNAME=/dev/vda2
SEC_TYPE=msdos
UUID=7B77-95E7
TYPE=vfat
PARTUUID=68b2905b-df3e-4fb3-80fa-49d1e773aa33

DEVNAME=/dev/vda3
LABEL=boot
UUID=1e759718-9420-4840-b151-5212e9cc1580
TYPE=xfs
PARTUUID=cb07c243-bc44-4717-853e-28852021225b

DEVNAME=/dev/vda1
PARTUUID=fac7f1fb-3e8d-4137-a512-961de09a5549

So my suspicion is that ds-detect is not detecting the mounted iso in both cases, but since in the previous version, _get_data() from DataSourceNoCloud was being called, it was doing its own thing to parse devices and labels. See

      label = self.ds_cfg.get("fs_label", "cidata")
        if label is not None:
            for dev in self._get_devices(label):
                try:
                    LOG.debug("Attempting to use data from %s", dev)

                    try:
                        seeded = util.mount_cb(
                            dev, _pp2d_callback, pp2d_kwargs
                        )
                    except ValueError:
                        LOG.warning(
                            "device %s with label=%s not a valid seed.",
                            dev,
                            label,
                        )
                        continue

                    mydata = _merge_new_seed(mydata, seeded)

                    LOG.debug("Using data from %s", dev)
                    found.append(dev)
                    break
                except OSError as e:
                    if e.errno != errno.ENOENT:
                        raise
                except util.MountFailedError:
                    util.logexc(
                        LOG, "Failed to mount %s when looking for data", dev
                    )

It was able to find the mounted iso. I do see this log in the working case in /var/log/cloud-init.log

2024-05-22 07:45:53,220 - DataSourceNoCloud.py[DEBUG]: Using data from /dev/loop0 

This log is missing in the non-working case completely.

So the fundamental question is why ds-detect is not detecting the labelled filesystem when blkid is showing it (at least when I ran it). When I ran read_fs_info_linux manually after sourcing ds-detect, it was able to find it. May be there is some kind of race somewhere, not sure.

@ani-sinha
Copy link
Contributor

I can't seem to find where ds-identify is being called from. Seems its from cloud-init-generator only which is executed very early during bootup [1] . If that is the case, at that time, /dev/sr0 is not yet setup. Maybe that is the reason why it can't find it so early during boot process.

  1. https://www.freedesktop.org/software/systemd/man/latest/systemd.generator.html

@ani-sinha
Copy link
Contributor

ani-sinha commented May 22, 2024

Another thing our customers have reported is that doing cloud-init init --local before cloud-init init && cloud-init modules -m config && cloud-init modules -m final gets around the issue. I do not know why that is. Anyway I am quite confused at this point. Any help will be appreciated.

@ani-sinha
Copy link
Contributor

Ah I see what is going on

   if name == "init":
        if args.local:
            rname, rdesc = ("init-local", "searching for local datasources")
        else:
            rname, rdesc = (
                "init-network",
                "searching for network datasources",
            )

This means, in order to detect local datasources, one has to explicitly so cloud-init init --local otherwise it will be init-network which no longer works for reasons I analyzed before.
I suspect, doing a init --local will also work in the previous case. Let me check with customer.

@TheRealFalcon
Copy link
Member

TheRealFalcon commented May 22, 2024

Another thing our customers have reported is that doing cloud-init init --local before cloud-init init && cloud-init modules -m config && cloud-init modules -m final gets around the issue. I do not know why that is. Anyway I am quite confused at this point. Any help will be appreciated.

See https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceNoCloud.py#L415-L418 . In local timeframe, cloud-init uses the NoCloud datasource. In network timeframe, it uses NoCloudNet. NoCloud doesn't override ds_detect() as so falls through the the base class which simply returns True.

If adding cloud-init init --local does not work, you may also want to run /usr/lib/cloud-init/ds-identify --force (that path may differ depending on OS) before all of it.

@TheRealFalcon
Copy link
Member

I do think that it's important to remember that cloud-init's primary use case is provisioning cloud instances on boot. We have 5 separate stages that run at boot, and later stages can and do rely on state that has been generated from earlier stages. While we generally try to maintain existing functionality without many breaking changes across version boundaries, that same expectation is not extended to running the services or the cloud-init binary ad hoc, especially when the services are not all run or are run out of order.

@ani-sinha
Copy link
Contributor

See https://github.com/canonical/cloud-init/blob/main/cloudinit/sources/DataSourceNoCloud.py#L415-L418 . In local timeframe, cloud-init uses the NoCloud datasource. In network timeframe, it uses NoCloudNet. NoCloud doesn't override ds_detect() as so falls through the the base class which simply returns True.

Yes I figured that out yesterday. I think DataSourceNoCloudNet was incorrectly covering up for a wrong use case.

@ani-sinha
Copy link
Contributor

ani-sinha commented May 24, 2024

If adding cloud-init init --local does not work, you may also want to run /usr/lib/cloud-init/ds-identify --force (that path may differ depending on OS) before all of it.

One final question. https://cloudinit.readthedocs.io/en/latest/reference/datasources/nocloud.html does not explicitly say that DataSourcenoCloudNet will not look into labelled filesystem. Whereas ds_detect() in DataSourcenoCloudNet does look into dmi and kernel command line, it does not look into local filesystems unlike DataSourceNoCloud does in the loop
for dev in self._get_devices(label) . Is this intentional? Notice that
(DataSourceNoCloudNet, (sources.DEP_FILESYSTEM, sources.DEP_NETWORK))
That is, this datasource is dependent both on FILESYSTEM (local) as well as NETWORK.
@TheRealFalcon

@TheRealFalcon
Copy link
Member

Is this intentional?

If by not look into local filesystems, you mean not look for a iso9660 drive, then yes, it is intentional. The sources.DEP_FILESYSTEM is not NoCloud specific. It's a way for a datasource to declare their dependencies of what they need access to in order to initialize the datasource. During init-local timeframe, DataSourceNoCloud needs no network, so it attempts to initialize the datasource in a way that doesn't require network. During init network timeframe, DataSourceNoCloudNet does need network, and so attempts to initialize the datasource in a way that does require a network connection.

@ani-sinha
Copy link
Contributor

Is this intentional?

If by not look into local filesystems, you mean not look for a iso9660 drive, then yes, it is intentional. The sources.DEP_FILESYSTEM is not NoCloud specific. It's a way for a datasource to declare their dependencies of what they need access to in order to initialize the datasource. During init-local timeframe, DataSourceNoCloud needs no network, so it attempts to initialize the datasource in a way that doesn't require network. During init network timeframe, DataSourceNoCloudNet does need network, and so attempts to initialize the datasource in a way that does require a network connection.

But why in the network case it does not also look into the file system for iso9660 ? Why just kernel command line and smbios?

@TheRealFalcon
Copy link
Member

Because those are the only valid sources of network seed data.

During cloud-init-local.servce, we have no network yet, so we use the DataSourceNoCloud class look for the labeled iso9660 filesystem for nocloud data (among looking other places). Network is not available yet, so we don't look for http/ftp seeds yet.

During cloud-init.service, we now have network connection, so we use the DataSourceNoCloudNet class, and because we already looked on the filesystem for nocloud data in the previous stage, we skip the local filesystem checks. We now only check sources that can contain a network-based seed. Those two sources are kernel cmdline and smbios. If either of those two sources contain something along the lines of ds=nocloud;s=https://10.42.42.42/cloud-init/configs/, then we know to use that network address as a seed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fixed in main The reported issue has already been fixed in the main branch.
Projects
None yet
Development

No branches or pull requests

7 participants