-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: auto-attach when pro license added on gcp #1930
Conversation
3a4beec
to
1b39311
Compare
ee23c93
to
c691cc7
Compare
f7f070f
to
55a3a25
Compare
@basak @paride @renanrodrigo @lucasmoura @blackboxsw @CalvoM Large PR warning 😬 I think this daemon PR is ready for review. Please let me know what you think. It only starts on GCP and the long polling is even turned off for now until GCP supports adding a license while an instance is running. The old 5min license-check timer is removed in this PR since the daemon replaces it's functionality. |
ua_config: | ||
poll_for_pro_license: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this extra config here for the daemon to run ? Wouldn't having the cloud-init gce file be enough for it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to turn on/off the long polling after the initial license check.
So when this is false, we only check the gce metadata once and don't enter the polling loop. This is sufficient for upgrade-to-pro for now because GCE requires a reboot to attach the pro license.
As soon as GCE supports adding the pro license without a reboot, then we can flip the default value of this config field to true, so the daemon will start polling from then on (This will require an SRU, but since everything else is already in place, then it should be a relatively straightforward review).
# If you are uninterested in the (free for personal use) Ubuntu Advantage | ||
# services, including security updates after standard EOL and kernel patching | ||
# without rebooting, then you can safely stop and disable this service: | ||
# sudo systemctl stop ubuntu-advantage.service | ||
# sudo systemctl disable ubuntu-advantage.service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add any considerations that this service only runs on GCP instances ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I'll add that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated! Let me know if you'd prefer different wording
if not cfg.poll_for_pro_license: | ||
LOG.debug("Configured to not poll for pro license, shutting down") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this check happen in the beginning of the function as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my answer above. This config option is only to prevent the daemon from entering the polling loop
55a3a25
to
f6879e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The daemon integration test is failing for me on Xenial:
When I run `systemctl restart ubuntu-advantage.service` with sudo # features/steps/steps.py:248
Assertion Failed:
Expected: <0>
but: was <5>
(...)
2022-02-10 18:00:43,950:INFO:pycloudlib.instance:executing: sudo touch /run/cloud-init/cloud-id-gce
2022-02-10 18:00:44,731:INFO:pycloudlib.instance:executing: sudo truncate -s 0 /var/log/ubuntu-advantage-daemon.log
2022-02-10 18:00:45,365:INFO:pycloudlib.instance:executing: sudo systemctl restart ubuntu-advantage.service
2022-02-10 18:00:46,059:ERROR:root:Error executing command: systemctl restart ubuntu-advantage.service
2022-02-10 18:00:46,060:ERROR:root:stdout:
2022-02-10 18:00:46,060:ERROR:root:stderr: Failed to restart ubuntu-advantage.service: Unit ubuntu-advantage.service not found.
upstream cloud-init landed the PR to surface /run/cloud-init/cloud-id(-*) files. They are already build in add-apt-repository ppa:cloud-init-dev/daily for Jammy (and uploaded to Jammy proper so should be in cloud image shortly). The ppa:cloud-init-dev/daily recipes should build and publish tomorrow for other series so it should be much easier to test. |
@blackboxsw Splendid! I'll adjust the tests to use the daily PPA then |
f6879e3
to
db3da66
Compare
@renanrodrigo I'm having trouble reproducing this failure, let's sync on Monday to figure it out |
@orndorffgrant and I saw it passing earlier while going through this, I re-tested it using a clean environment and it passed consistently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are passing for me. I think we should only add tests for the new daemon module and the PR should be good to go
@@ -0,0 +1,117 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are missing unit tests for this module
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed up unit tests for this module. But there is some interference such that when run as part of the entire test suite, some fail. I will look into it tomorrow
8f63590
to
b892890
Compare
uaclient/tests/test_daemon.py
Outdated
@mock.patch(M_PATH + "util.subp") | ||
class TestStart: | ||
def test_start_success(self, m_subp): | ||
start() | ||
assert [ | ||
mock.call(["systemctl", "start", "ubuntu-advantage.service"]) | ||
] == m_subp.call_args_list | ||
|
||
@mock.patch(M_PATH + "LOG.warning") | ||
def test_start_warning(self, m_log_warning, m_subp): | ||
err = util.ProcessExecutionError("cmd") | ||
m_subp.side_effect = err | ||
start() | ||
assert [ | ||
mock.call(["systemctl", "start", "ubuntu-advantage.service"]) | ||
] == m_subp.call_args_list | ||
assert [mock.call(err)] == m_log_warning.call_args_list | ||
|
||
|
||
@mock.patch(M_PATH + "util.subp") | ||
class TestStop: | ||
def test_stop_success(self, m_subp): | ||
stop() | ||
assert [ | ||
mock.call(["systemctl", "stop", "ubuntu-advantage.service"]) | ||
] == m_subp.call_args_list | ||
|
||
@mock.patch(M_PATH + "LOG.warning") | ||
def test_stop_warning(self, m_log_warning, m_subp): | ||
err = util.ProcessExecutionError("cmd") | ||
m_subp.side_effect = err | ||
stop() | ||
assert [ | ||
mock.call(["systemctl", "stop", "ubuntu-advantage.service"]) | ||
] == m_subp.call_args_list | ||
assert [mock.call(err)] == m_log_warning.call_args_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use pytest.mark.parametrize
here instead, since the test are really similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair suggestion, but I think I'd rather avoid adding an extra layer of abstraction over what is being tested. It would make coming back to the code in 6 months and asking "what is this test actually testing" just a little harder.
Also, while the tests for both functions are essentially the same now, that might not be true if they change in the future (or even during further iteration of this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good SRU manual test scripts in general for typical upgrade cases.
If we don't already have one. In a separate PR, It would be nice to grow an integration testthat runs an upgrade from the released version of ua-tools in -updates to daily ppa:ua-client/daily to assert there are no failures in upgrade path and expected systemd services.Some additional automated test coverage of upgrade path can ensure we don't degrade packaging upgrade path behavior and related systemd services.
I think we generally cover automated tests for the following upgrade test cases:
- running the suite of integration tests against an image produced by the version of ua-tools that lives in ppa:ua-client/daily
- LTS to LTS upgrade test in features/ubuntu_upgrade.feature
But I don't think we have a behave test which validates the package upgrade from the version in xenial-updates to the version in ppa:ua-client/daily
I believe that we need to change the After= declarations in the ubuntu-advantage.service, but beyond that the rest of my comments you can take as you see fit. I'll still need tomorrow for reviewing the daemon.py behavior and notify usage before I can fully approve here.
if [ -f $OLD_LICENSE_CHECK_MARKER_FILE ]; then | ||
rm -f $OLD_LICENSE_CHECK_MARKER_FILE | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that it really matters, but we can avoid the -f test as rm -f exits zero in absence of file
if [ -f $OLD_LICENSE_CHECK_MARKER_FILE ]; then | |
rm -f $OLD_LICENSE_CHECK_MARKER_FILE | |
fi | |
rm -f $OLD_LICENSE_CHECK_MARKER_FILE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simpler and better +1
When I run `systemctl restart ubuntu-advantage.service` with sudo | ||
# TODO-END: remove this chunk once cloud-init supports this feature | ||
When I wait `1` seconds | ||
When I run `cat /var/log/ubuntu-advantage-daemon.log` with sudo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth also checking explicit systemctl behavior for conditionpathexists is caught?
When I run `cat /var/log/ubuntu-advantage-daemon.log` with sudo | |
When I run `systemctl status ubuntu-advantage.service` as non-root | |
Then stdout matches regexp: | |
""" | |
ConditionPathExists=\/run\/cloud-init\/cloud-id-gce was not met | |
""" | |
When I run `cat /var/log/ubuntu-advantage-daemon.log` with sudo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea
features/daemon.feature
Outdated
""" | ||
daemon ending | ||
""" | ||
# verify attach tops it immediately and doesn't restart after reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# verify attach tops it immediately and doesn't restart after reboot | |
# verify attach stops it immediately and doesn't restart after reboot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops :) good catch
""" | ||
daemon ending | ||
""" | ||
# verify it stays on when configured to do so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to assert that systemctl agrees with the expected state of the service?
My expectation from systemctl is-failed will be it'll report "inactive" if we successfully exited and remain inactive. It'll report "failed" if we exited non-zero.
# verify it stays on when configured to do so | |
When I run `systemctl is-enabled ubuntu-advantage.service` as non-root | |
Then stdout matches regex: | |
""" | |
enabled | |
""" | |
When I run `systemctl is-failed ubuntu-advantage.service` as non-root | |
Then stdout matches regex: | |
""" | |
inactive | |
""" | |
# verify it stays on when configured to do so |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh interesting, yeah I'll look into adding this
systemd/ubuntu-advantage.service
Outdated
[Unit] | ||
Description=Ubuntu Advantage GCP Auto Attach Daemon | ||
Documentation=man:ubuntu-advantage https://ubuntu.com/advantage | ||
After=network.target network-online.target systemd-networkd.service ua-auto-attach.service cloud-config.target |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After declaration looks good for the ua-auto-attach.service which is oneshot so it'll guarantee we don't start unit after ua-auto-attach.service is complete.
I don't think we want After=cloud-config.target
because that means we can start running the second cloud-init init
stage completes, but before or parallel with cloud-config.service running. The cloud-config.service boot stage runs all modules in /etc/cloud/cloud.cfg cloud_config_modules
section which includes ubuntu-advantage
module which reacts to any ua
#cloud-config directives. I ultimately think we want this daemon only to run after cloud-config.service is complete, so that any user-data with ubuntu-advantage:
directives is already processed by cloud-init so the daemon may not be needed in this case if cloud-config user-data manually attached the instance at launch.
Long winded... but the suggestion is to order After=cloud-config.service
which will be certain any UA-related user-data is also done being processed
After=network.target network-online.target systemd-networkd.service ua-auto-attach.service cloud-config.target | |
After=network.target network-online.target systemd-networkd.service ua-auto-attach.service cloud-config.service |
Example timestamps emitted from separate systemd services declared After=cloud-config.target vs After=cloud-config.service vs. when ubuntu-advantage cloud-init module is run
root@dev-x:~# cat /didit
# edit [correctedtypo cloud-init.target below to cloud-config.target. But, you got the picture
cloud-config.target: 2022-03-02T22:49:16,460108103+00:00
cloud-config.service: 2022-03-02T22:49:20,023110396+00:00
root@dev-x:~# fgrep "Running module ubuntu-advantage" /var/log/cloud-init.log
2022-03-02 22:49:19,962 - stages.py[DEBUG]: Running module ubuntu-advantage (<module 'cloudinit.config.cc_ubuntu_advantage' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ubuntu_advantage.py'>) with frequency once-per-instance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent catch!!!
Yes I agree and I thought that's what this was doing, but after re-reviewing cloud-config.target and cloud-config.service, you are right. We should be After: cloud-config.service
to guarantee we run after all relevant cloud-init modules.
After=network.target network-online.target systemd-networkd.service ua-auto-attach.service cloud-config.target | ||
|
||
# Only run if not already attached | ||
ConditionPathExists=!/var/lib/ubuntu-advantage/private/machine-token.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth explicit integration test coverage as mentioned above to assert both of the specific not-enabled cases to make sure our ConditionPathExists declarations are behaving as "AND" instead of "OR"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have the section of the integration test labeled with comment
# verify attach stops it immediately and doesn't restart after reboot
The "and doesn't restart after reboot" I think covers this scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and I think explicitly validating the "Condition failed reason would be helpful in that test, I think you commented in the affirmative on my comment over in that space/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks for clarifying - and I should've responded again here, this was added on line 88 :)
Then I verify that running `systemctl status ubuntu-advantage.service` `with sudo` exits `3` | ||
Then stdout matches regexp: | ||
""" | ||
Active: inactive \(dead\) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we also assert that it fails because of the failed Conditional machine-token.json here do you think so we can establish the reason for the dead status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes!
uaclient/daemon.py
Outdated
LOG.debug("Cancelling polling") | ||
return | ||
except exceptions.DelayProLicensePolling: | ||
time.sleep(600) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make this sleep configurable to allow GCP folks to backoff if the 10 min interval is too often?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should only happen in case of errors.
I imagine if there is a persistent error then the user would just want to turn off the polling completely with the feature flag added elsewhere in this PR.
It wouldn't be too hard to make this configurable too though - I think its probably worth it.
Related: do you think 10 minutes is a good default here? Too short?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SRU pre-review: This seems to implement the spec as-approved, and doesn't appear to introduce unrelated changes. This would be an acceptable SRU.
@@ -355,10 +350,24 @@ remove_old_systemd_timers() { | |||
# Since we actually want to remove this service from now on | |||
# we have replicated that behavior here | |||
if [ -L $OLD_MESSAGING_TIMER_MASKED_LOCATION ]; then | |||
if [ -x "/usr/bin/deb-systemd-helper" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SRU nit: Dropping the test for deb-systemd-helper
appears sensible (init-system-helpers
is Essential: yes, so people messing with it are asking for trouble), but isn't a necessary part of these changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point! Yes this is an unrelated change and really shouldn't be lumped in with this feature. I'll give this change it's own commit and it's own PR to keep things separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this change over here: #1993 where we'll separately debate the merits
# Only run if not already attached | ||
ConditionPathExists=!/var/lib/ubuntu-advantage/private/machine-token.json | ||
# Only run on GCP | ||
ConditionPathExists=/run/cloud-init/cloud-id-gce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SRU review: Am I correct in understanding that as per the spec document this is pending a new cloud-init feature?
Not blocking, just checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you are correct. The feature was merged into upstream cloud-init here: canonical/cloud-init#1244
@blackboxsw I believe the latest couple commits address all your concerns. I'll leave them as separate commits for now so you can more easily remove them After you review, I'll rebase and squash things together - that is also when I'll take out the unrelated postinst change that RAOF mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent @orndorffgrant. Thank you for the updates. Confirmed upgrade and downgrade behavior is clean with setup and teardown of ua-license-check vs ubuntu-advantage services.
Though your integration tests cover it well, confirmed that the daemon doesn't start polling automatically due to your config switch set false.
Something to note is the resident memory that we sit on even when idle:
RES SHR
26364 12500 S 0.0 0.7 0:00.29 python3
We may want to look at ways to decrease this mem footprint in subsequent efforts. I don't know if UA's lru_cache is adding much to this footprint, probably not remotely as much as all the module imports we use throughout uaclient. But, it's something we should be wary of as we grow functionality/features in this daemon.
After=network.target network-online.target systemd-networkd.service ua-auto-attach.service cloud-config.target | ||
|
||
# Only run if not already attached | ||
ConditionPathExists=!/var/lib/ubuntu-advantage/private/machine-token.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and I think explicitly validating the "Condition failed reason would be helpful in that test, I think you commented in the affirmative on my comment over in that space/
return self.cfg.get("ua_config", {}).get("poll_for_pro_license", False) | ||
|
||
@poll_for_pro_license.setter | ||
def poll_for_pro_license(self, value: bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is anything using this setter on UAConfig? I'm guessing no. Is the reason you are defining this now so that you can lay groundwork for eventually integrating this into action_config_set
/action_config_unset
in the ua config
subcommand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly, See the TODO in the comment on line 232.
I also just created a Jira Issue for this: https://warthogs.atlassian.net/browse/SC-860
# TODO: when polling is supported | ||
# 1. add this field to UA_CONFIGURABLE_KEYS | ||
return self.cfg.get("ua_config", {}).get( | ||
"polling_error_retry_delay", 600 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing the additional config option here. Though it may very well be above-and-beyond, making it configurable will give folks an option to backoff if UA is hitting frequent errors related to accessing GCP IMDS.
|
||
daemon.poll_for_pro_license(cfg) | ||
|
||
LOG.debug("daemon ending") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not critical, but I wonder if it's worth handling signal.SIGTERM to also report daemon is being shutdown, otherwise we leave open logs in the "daemon starting" without tracking the external termination of that process. Can be separate PR if you think it's worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea! I've created a separate Jira issue for it here: https://warthogs.atlassian.net/browse/SC-861
fcf8984
to
c24ee58
Compare
Thanks @blackboxsw Ack to memory consumption. There is definitely room for optimization there. Based on some quick testing, we could pretty easily lower that by almost 2Mb by writing a daemon specific logging setup function. Jira issue here: https://warthogs.atlassian.net/browse/SC-862 Then we could potentially lower it by a couple more Mb with a more extensive refactor to UAConfig. We really only need read access to the config for the daemon, and UAConfig currently does much more than that. Jira Issue here: https://warthogs.atlassian.net/browse/SC-863 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor nit, let me know if it makes sense or not
except Exception as e: | ||
LOG.exception(e) | ||
cfg.add_notice("", status.NOTICE_DAEMON_AUTO_ATTACH_FAILED) | ||
lock.clear_lock_file_if_present() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the lock cleared by default after we exit the context manager ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be. the only case where it wouldn't is if delete_cache_key
failed for some reason. Then this would try again. I was just following the pattern we have elsewhere where we call this function when we catch an exception. It is a noop
if the lock was already cleared so it seemed harmless.
I can remove it if you think it is unnecessary though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I think we can keep it just as a safe mechanism then
Just reporting that I re-ran all the new integration tests with the recent changes and they all passed I will also move the manual test scripts out of the 27.7 folder into a new 27.8 folder and re-run them |
c24ee58
to
b162548
Compare
Done and all manual tests passed |
Notes:
The license-check commit is exactly the same as the last one we merged
The daemon commit is significantly different, but the details of fetching the license are the same.
The biggest thing here (besides making sure the daemon works for auto-upgrade-to-pro) is making sure the daemon doesn't even run on non-gcpgeneric platforms.
TODO:
Proposed Commit Message
Test Steps
Run new integration tests. Maybe not every combination of series/platform, but run a sample of series and platforms to make sure they pass.
Run the manual test scripts
gcp_
scripts require two arguments: (1) a deb of ua from this PR and (2) a deb of cloud-init that includes the changes in cloud-id: publish /run/cloud-init/cloud-id-<cloud-type> files cloud-init#1244Desired commit type
Checklist