Refactor status.py #4864

TheRealFalcon · 2024-02-08T14:59:20Z

Proposed Commit Message

refactor: Refactor status.py

The one major functional change in this commit is around how we detect
running vs error states. Status reporting has a fundamental problem in
that we can't accurately tell if cloud-init is done because cloud-init
is actually several processes. There isn't always a way to tell whether
a service isn't running because it simply hasn't started yet vs the
service being blocked/crashed and will never start/finish.

In the past, if any of the cloud-init services reported an error, we
would assume that cloud-init as a whole has crashed and report that
cloud-init is "done", but with error. This commit flips that logic to
assume that cloud-init is always running unless we see indication that
cloud-init has completely finished. This means that
`cloud-init status --wait` may run forever if cloud-init has crashed or
is blocked on another service. This is preferable to returning early
and potentially allowing provisioning scripts that wait for cloud-init
to continue provisioning. On systemd-enabled systems, there is extra
logic to inspect the state of the services, so this should rarely be a
problem in practice.

Additionally, this commit includes the following refactoring:
- Split UXAppStatus into RunningStatus and ConditionStatus so they can
  be tracked independently
- Simplify the tabular printing
- On error print extended_status as "error - done" or "error - running"
- Add several helper functions in `get_status_details` to simplify
  logic
- Rename `_get_error_or_running_from_systemd` to `systemd_failed`
  and only return if error is detected
- Change "is running" logic to be determined solely by the existence
  of the status.json and results.json files.

Additional Context

Test Steps

Checklist

My code follows the process laid out in the documentation
I have updated or added any unit tests accordingly
I have updated or added any documentation accordingly

Merge type

Squash merge using "Proposed Commit Message"
Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

holmanb

Thanks @TheRealFalcon, huge maintainability improvements here - and a much better UI too. Thanks for doing this. I have a couple of minor change requests inline - mostly just comments and questions.

We have a couple of strings that still need to be updated in the docs, comments, and tests:

$ rg "degraded running"
cloudinit/cmd/status.py
165:    # Handle the "degraded done" and "degraded running" states

doc/rtd/howto/status.rst
58:    "degraded running"

$ rg "degraded done"
doc/rtd/howto/status.rst
43:      "extended_status": "degraded done",
57:    "degraded done"

doc/rtd/explanation/exported_errors.rst
27:      "extended_status": "degraded done",
83:      "extended_status": "degraded done",

cloudinit/cmd/status.py
165:    # Handle the "degraded done" and "degraded running" states

tests/unittests/cmd/test_status.py
671:                    "extended_status": "degraded done",

holmanb · 2024-02-20T22:44:39Z

tests/unittests/cmd/test_cloud_id.py

@@ -32,8 +34,9 @@
    {},
 )
 STATUS_DETAILS_NOT_RUN = status.StatusDetails(


This was formerly named after the UXAppStatus it contained, however we rename NOT_RUN to NOT_STARTED under RunningStatus. Could we update this variable to also follow the rename?

i.e.

Suggested change

STATUS_DETAILS_NOT_RUN = status.StatusDetails(

STATUS_DETAILS_NOT_STARTED = status.StatusDetails(

and the variable updated throughout?

cloudinit/cmd/status.py

holmanb · 2024-02-20T23:01:00Z

cloudinit/cmd/status.py

+        description = "Failed due to systemd unit failure"
+        errors.append(
+            "Failed due to sysetmd unit failure. Ensure all cloud-init "
+            "services are enabled, and check 'systemctl' or 'journalctl' "


Which service to check is unknown to the user by this message, but we could easily return the service name that failed from systemd_failed() for a more informative error message.

I'm not requesting this change now, but it's an idea for a future improvement we could make.

holmanb · 2024-02-20T23:05:16Z

cloudinit/cmd/status.py

+
+    if (
+        running_status == RunningStatus.RUNNING
+        and uses_systemd()


We might eventually want to abstract systemd_failed() into the distros/ directory (or maybe even an init-system-specific abstraction) so that other init systems can be better extended to check status. This isn't something we need to do today, just an idea for the future.

holmanb · 2024-02-20T23:14:43Z

cloudinit/cmd/status.py

+    status_file, result_file, boot_status_code, latest_event
+) -> RunningStatus:
+    """Return the running status of cloud-init."""
+    if is_running(status_file, result_file):


This logic normally should work, but I think it would be slightly safer to do if boot_status_code in DISABLED_BOOT_CODES before is_running(status_file, result_file) due to edge cases. This is because I think that boot_status_code is a higher confidence signal than the file existence.

A system that experiences a cloud-init crash that leaves behind a status file but not a result file will return RunningStatus.RUNNING (non-systemd). While this makes sense, if cloud-init is subsequently disabled we will still report RunningStatus.RUNNING due to the existence of this file.

cloudinit/cmd/status.py

holmanb · 2024-02-20T23:36:37Z

cloudinit/cmd/status.py

 ) -> str:
    """Query systemd with retries and return output."""
    while True:
        try:
            return subp.subp(["systemctl", *systemctl_args]).stdout.strip()
        except subp.ProcessExecutionError as e:
-            if existing_status and existing_status in (


After this change, the message to stderr below might get printed once from both get_status_details()::get_bootstatus() and again in get_status_details()::systemd_failed() when cloud-init status is called early in boot. Is this intentional? +1 either way, but it just seems unnecessary from a user's perspective.

I refactored it some to account for this. Let me know if that works for you

cloudinit/cmd/status.py

holmanb · 2024-02-20T23:42:50Z

cloudinit/cmd/status.py

+    """
+    # If we're done and have errors, we're in an error state
+    if condition == ConditionStatus.ERROR:
+        return ("error", f"{condition.value} - {running.value}")


style question: Was it intentional to use parenthesis on this line and the next return but not the final one? I personally prefer never using them when returning tuples, but I'm curious if there is some stylistic reason for doing it this way?

Nope, just lack of conscious thought. I was also notorious at a previous company for mixing double quotes and single quotes in the same file/function, but thankfully black has made that a non-issue for me. 😄

I removed the parens from these.

The one major functional change in this commit is around how we detect running vs error states. Status reporting has a fundamental problem in that we can't accurately tell if cloud-init is done because cloud-init is actually several processes. There isn't always a way to tell whether a service isn't running because it simply hasn't started yet vs the service being blocked/crashed and will never start/finish. In the past, if any of the cloud-init services reported an error, we would assume that cloud-init as a whole has crashed and report that cloud-init is "done", but with error. This commit flips that logic to assume that cloud-init is always running unless we see indication that cloud-init has completely finished. This means that `cloud-init status --wait` may run forever if cloud-init has crashed or is blocked on another service. This is preferable to returning early and potentially allowing provisioning scripts that wait for cloud-init to continue provisioning. On systemd-enabled systems, there is extra logic to inspect the state of the services, so this should rarely be a problem in practice. Additionally, this commit includes the following refactoring: - Split UXAppStatus into RunningStatus and ConditionStatus so they can be tracked independently - Simplify the tabular printing - On error print extended_status as "error - done" or "error - running" - Add several helper functions in `get_status_details` to simplify logic - Rename `_get_error_or_running_from_systemd` to `systemd_failed` and only return if error is detected - Change "is running" logic to be determined solely by the existence of the status.json and results.json files.

TheRealFalcon · 2024-02-22T03:10:51Z

Also force pushed a rebase

holmanb

LGTM, thanks @TheRealFalcon!

The one major functional change in this commit is around how we detect running vs error states. Status reporting has a fundamental problem in that we can't accurately tell if cloud-init is done because cloud-init is actually several processes. There isn't always a way to tell whether a service isn't running because it simply hasn't started yet vs the service being blocked/crashed and will never start/finish. In the past, if any of the cloud-init services reported an error, we would assume that cloud-init as a whole has crashed and report that cloud-init is "done", but with error. This commit flips that logic to assume that cloud-init is always running unless we see indication that cloud-init has completely finished. This means that `cloud-init status --wait` may run forever if cloud-init has crashed or is blocked on another service. This is preferable to returning early and potentially allowing provisioning scripts that wait for cloud-init to continue provisioning. On systemd-enabled systems, there is extra logic to inspect the state of the services, so this should rarely be a problem in practice. Additionally, this commit includes the following refactoring: - Split UXAppStatus into RunningStatus and ConditionStatus so they can be tracked independently - Simplify the tabular printing - On error print extended_status as "error - done" or "error - running" - Add several helper functions in `get_status_details` to simplify logic - Rename `_get_error_or_running_from_systemd` to `systemd_failed` and only return if error is detected - Change "is running" logic to be determined solely by the existence of the status.json and results.json files.

TheRealFalcon force-pushed the status branch from a707f8d to 014cc4f Compare February 8, 2024 15:13

holmanb self-assigned this Feb 12, 2024

holmanb added the 24.1 label Feb 13, 2024

holmanb requested changes Feb 20, 2024

View reviewed changes

TheRealFalcon added 2 commits February 21, 2024 21:08

comments

04b62bf

TheRealFalcon force-pushed the status branch from befc8f6 to 04b62bf Compare February 22, 2024 03:10

holmanb approved these changes Feb 23, 2024

View reviewed changes

TheRealFalcon merged commit d175170 into canonical:main Feb 24, 2024
29 checks passed

TheRealFalcon deleted the status branch February 24, 2024 04:49

TheRealFalcon mentioned this pull request Aug 7, 2024

Cloud-init status always reporting running #5304

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor status.py #4864

Refactor status.py #4864

TheRealFalcon commented Feb 8, 2024

holmanb left a comment •

edited

Loading

holmanb Feb 20, 2024

holmanb Feb 20, 2024

holmanb Feb 20, 2024 •

edited

Loading

holmanb Feb 20, 2024 •

edited

Loading

holmanb Feb 20, 2024

TheRealFalcon Feb 22, 2024

holmanb Feb 20, 2024

TheRealFalcon Feb 22, 2024

TheRealFalcon commented Feb 22, 2024

holmanb left a comment

	STATUS_DETAILS_NOT_RUN = status.StatusDetails(
	STATUS_DETAILS_NOT_STARTED = status.StatusDetails(

Refactor status.py #4864

Refactor status.py #4864

Conversation

TheRealFalcon commented Feb 8, 2024

Proposed Commit Message

Additional Context

Test Steps

Checklist

Merge type

holmanb left a comment • edited Loading

Choose a reason for hiding this comment

holmanb Feb 20, 2024

Choose a reason for hiding this comment

holmanb Feb 20, 2024

Choose a reason for hiding this comment

holmanb Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

holmanb Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

holmanb Feb 20, 2024

Choose a reason for hiding this comment

TheRealFalcon Feb 22, 2024

Choose a reason for hiding this comment

holmanb Feb 20, 2024

Choose a reason for hiding this comment

TheRealFalcon Feb 22, 2024

Choose a reason for hiding this comment

TheRealFalcon commented Feb 22, 2024

holmanb left a comment

Choose a reason for hiding this comment

holmanb left a comment •

edited

Loading

holmanb Feb 20, 2024 •

edited

Loading

holmanb Feb 20, 2024 •

edited

Loading