-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor status.py #4864
Refactor status.py #4864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @TheRealFalcon, huge maintainability improvements here - and a much better UI too. Thanks for doing this. I have a couple of minor change requests inline - mostly just comments and questions.
We have a couple of strings that still need to be updated in the docs, comments, and tests:
$ rg "degraded running"
cloudinit/cmd/status.py
165: # Handle the "degraded done" and "degraded running" states
doc/rtd/howto/status.rst
58: "degraded running"
$ rg "degraded done"
doc/rtd/howto/status.rst
43: "extended_status": "degraded done",
57: "degraded done"
doc/rtd/explanation/exported_errors.rst
27: "extended_status": "degraded done",
83: "extended_status": "degraded done",
cloudinit/cmd/status.py
165: # Handle the "degraded done" and "degraded running" states
tests/unittests/cmd/test_status.py
671: "extended_status": "degraded done",
tests/unittests/cmd/test_cloud_id.py
Outdated
@@ -32,8 +34,9 @@ | |||
{}, | |||
) | |||
STATUS_DETAILS_NOT_RUN = status.StatusDetails( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was formerly named after the UXAppStatus it contained, however we rename NOT_RUN
to NOT_STARTED
under RunningStatus
. Could we update this variable to also follow the rename?
i.e.
STATUS_DETAILS_NOT_RUN = status.StatusDetails( | |
STATUS_DETAILS_NOT_STARTED = status.StatusDetails( |
and the variable updated throughout?
description = "Failed due to systemd unit failure" | ||
errors.append( | ||
"Failed due to sysetmd unit failure. Ensure all cloud-init " | ||
"services are enabled, and check 'systemctl' or 'journalctl' " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which service to check is unknown to the user by this message, but we could easily return the service name that failed from systemd_failed()
for a more informative error message.
I'm not requesting this change now, but it's an idea for a future improvement we could make.
|
||
if ( | ||
running_status == RunningStatus.RUNNING | ||
and uses_systemd() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might eventually want to abstract systemd_failed()
into the distros/
directory (or maybe even an init-system-specific abstraction) so that other init systems can be better extended to check status. This isn't something we need to do today, just an idea for the future.
cloudinit/cmd/status.py
Outdated
status_file, result_file, boot_status_code, latest_event | ||
) -> RunningStatus: | ||
"""Return the running status of cloud-init.""" | ||
if is_running(status_file, result_file): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic normally should work, but I think it would be slightly safer to do if boot_status_code in DISABLED_BOOT_CODES
before is_running(status_file, result_file)
due to edge cases. This is because I think that boot_status_code
is a higher confidence signal than the file existence.
A system that experiences a cloud-init crash that leaves behind a status file but not a result file will return RunningStatus.RUNNING
(non-systemd). While this makes sense, if cloud-init is subsequently disabled we will still report RunningStatus.RUNNING
due to the existence of this file.
) -> str: | ||
"""Query systemd with retries and return output.""" | ||
while True: | ||
try: | ||
return subp.subp(["systemctl", *systemctl_args]).stdout.strip() | ||
except subp.ProcessExecutionError as e: | ||
if existing_status and existing_status in ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After this change, the message to stderr below might get printed once from both get_status_details()::get_bootstatus()
and again in get_status_details()::systemd_failed()
when cloud-init status
is called early in boot. Is this intentional? +1 either way, but it just seems unnecessary from a user's perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored it some to account for this. Let me know if that works for you
cloudinit/cmd/status.py
Outdated
""" | ||
# If we're done and have errors, we're in an error state | ||
if condition == ConditionStatus.ERROR: | ||
return ("error", f"{condition.value} - {running.value}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
style question: Was it intentional to use parenthesis on this line and the next return but not the final one? I personally prefer never using them when returning tuples, but I'm curious if there is some stylistic reason for doing it this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, just lack of conscious thought. I was also notorious at a previous company for mixing double quotes and single quotes in the same file/function, but thankfully black has made that a non-issue for me. 😄
I removed the parens from these.
The one major functional change in this commit is around how we detect running vs error states. Status reporting has a fundamental problem in that we can't accurately tell if cloud-init is done because cloud-init is actually several processes. There isn't always a way to tell whether a service isn't running because it simply hasn't started yet vs the service being blocked/crashed and will never start/finish. In the past, if any of the cloud-init services reported an error, we would assume that cloud-init as a whole has crashed and report that cloud-init is "done", but with error. This commit flips that logic to assume that cloud-init is always running unless we see indication that cloud-init has completely finished. This means that `cloud-init status --wait` may run forever if cloud-init has crashed or is blocked on another service. This is preferable to returning early and potentially allowing provisioning scripts that wait for cloud-init to continue provisioning. On systemd-enabled systems, there is extra logic to inspect the state of the services, so this should rarely be a problem in practice. Additionally, this commit includes the following refactoring: - Split UXAppStatus into RunningStatus and ConditionStatus so they can be tracked independently - Simplify the tabular printing - On error print extended_status as "error - done" or "error - running" - Add several helper functions in `get_status_details` to simplify logic - Rename `_get_error_or_running_from_systemd` to `systemd_failed` and only return if error is detected - Change "is running" logic to be determined solely by the existence of the status.json and results.json files.
befc8f6
to
04b62bf
Compare
Also force pushed a rebase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @TheRealFalcon!
The one major functional change in this commit is around how we detect running vs error states. Status reporting has a fundamental problem in that we can't accurately tell if cloud-init is done because cloud-init is actually several processes. There isn't always a way to tell whether a service isn't running because it simply hasn't started yet vs the service being blocked/crashed and will never start/finish. In the past, if any of the cloud-init services reported an error, we would assume that cloud-init as a whole has crashed and report that cloud-init is "done", but with error. This commit flips that logic to assume that cloud-init is always running unless we see indication that cloud-init has completely finished. This means that `cloud-init status --wait` may run forever if cloud-init has crashed or is blocked on another service. This is preferable to returning early and potentially allowing provisioning scripts that wait for cloud-init to continue provisioning. On systemd-enabled systems, there is extra logic to inspect the state of the services, so this should rarely be a problem in practice. Additionally, this commit includes the following refactoring: - Split UXAppStatus into RunningStatus and ConditionStatus so they can be tracked independently - Simplify the tabular printing - On error print extended_status as "error - done" or "error - running" - Add several helper functions in `get_status_details` to simplify logic - Rename `_get_error_or_running_from_systemd` to `systemd_failed` and only return if error is detected - Change "is running" logic to be determined solely by the existence of the status.json and results.json files.
The one major functional change in this commit is around how we detect running vs error states. Status reporting has a fundamental problem in that we can't accurately tell if cloud-init is done because cloud-init is actually several processes. There isn't always a way to tell whether a service isn't running because it simply hasn't started yet vs the service being blocked/crashed and will never start/finish. In the past, if any of the cloud-init services reported an error, we would assume that cloud-init as a whole has crashed and report that cloud-init is "done", but with error. This commit flips that logic to assume that cloud-init is always running unless we see indication that cloud-init has completely finished. This means that `cloud-init status --wait` may run forever if cloud-init has crashed or is blocked on another service. This is preferable to returning early and potentially allowing provisioning scripts that wait for cloud-init to continue provisioning. On systemd-enabled systems, there is extra logic to inspect the state of the services, so this should rarely be a problem in practice. Additionally, this commit includes the following refactoring: - Split UXAppStatus into RunningStatus and ConditionStatus so they can be tracked independently - Simplify the tabular printing - On error print extended_status as "error - done" or "error - running" - Add several helper functions in `get_status_details` to simplify logic - Rename `_get_error_or_running_from_systemd` to `systemd_failed` and only return if error is detected - Change "is running" logic to be determined solely by the existence of the status.json and results.json files.
The one major functional change in this commit is around how we detect running vs error states. Status reporting has a fundamental problem in that we can't accurately tell if cloud-init is done because cloud-init is actually several processes. There isn't always a way to tell whether a service isn't running because it simply hasn't started yet vs the service being blocked/crashed and will never start/finish. In the past, if any of the cloud-init services reported an error, we would assume that cloud-init as a whole has crashed and report that cloud-init is "done", but with error. This commit flips that logic to assume that cloud-init is always running unless we see indication that cloud-init has completely finished. This means that `cloud-init status --wait` may run forever if cloud-init has crashed or is blocked on another service. This is preferable to returning early and potentially allowing provisioning scripts that wait for cloud-init to continue provisioning. On systemd-enabled systems, there is extra logic to inspect the state of the services, so this should rarely be a problem in practice. Additionally, this commit includes the following refactoring: - Split UXAppStatus into RunningStatus and ConditionStatus so they can be tracked independently - Simplify the tabular printing - On error print extended_status as "error - done" or "error - running" - Add several helper functions in `get_status_details` to simplify logic - Rename `_get_error_or_running_from_systemd` to `systemd_failed` and only return if error is detected - Change "is running" logic to be determined solely by the existence of the status.json and results.json files.
The one major functional change in this commit is around how we detect running vs error states. Status reporting has a fundamental problem in that we can't accurately tell if cloud-init is done because cloud-init is actually several processes. There isn't always a way to tell whether a service isn't running because it simply hasn't started yet vs the service being blocked/crashed and will never start/finish. In the past, if any of the cloud-init services reported an error, we would assume that cloud-init as a whole has crashed and report that cloud-init is "done", but with error. This commit flips that logic to assume that cloud-init is always running unless we see indication that cloud-init has completely finished. This means that `cloud-init status --wait` may run forever if cloud-init has crashed or is blocked on another service. This is preferable to returning early and potentially allowing provisioning scripts that wait for cloud-init to continue provisioning. On systemd-enabled systems, there is extra logic to inspect the state of the services, so this should rarely be a problem in practice. Additionally, this commit includes the following refactoring: - Split UXAppStatus into RunningStatus and ConditionStatus so they can be tracked independently - Simplify the tabular printing - On error print extended_status as "error - done" or "error - running" - Add several helper functions in `get_status_details` to simplify logic - Rename `_get_error_or_running_from_systemd` to `systemd_failed` and only return if error is detected - Change "is running" logic to be determined solely by the existence of the status.json and results.json files.
Proposed Commit Message
Additional Context
Test Steps
Checklist
Merge type