Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor status.py #4864

Merged
merged 2 commits into from
Feb 24, 2024
Merged

Refactor status.py #4864

merged 2 commits into from
Feb 24, 2024

Conversation

TheRealFalcon
Copy link
Member

Proposed Commit Message

refactor: Refactor status.py

The one major functional change in this commit is around how we detect
running vs error states. Status reporting has a fundamental problem in
that we can't accurately tell if cloud-init is done because cloud-init
is actually several processes. There isn't always a way to tell whether
a service isn't running because it simply hasn't started yet vs the
service being blocked/crashed and will never start/finish.

In the past, if any of the cloud-init services reported an error, we
would assume that cloud-init as a whole has crashed and report that
cloud-init is "done", but with error. This commit flips that logic to
assume that cloud-init is always running unless we see indication that
cloud-init has completely finished. This means that
`cloud-init status --wait` may run forever if cloud-init has crashed or
is blocked on another service. This is preferable to returning early
and potentially allowing provisioning scripts that wait for cloud-init
to continue provisioning. On systemd-enabled systems, there is extra
logic to inspect the state of the services, so this should rarely be a
problem in practice.

Additionally, this commit includes the following refactoring:
- Split UXAppStatus into RunningStatus and ConditionStatus so they can
  be tracked independently
- Simplify the tabular printing
- On error print extended_status as "error - done" or "error - running"
- Add several helper functions in `get_status_details` to simplify
  logic
- Rename `_get_error_or_running_from_systemd` to `systemd_failed`
  and only return if error is detected
- Change "is running" logic to be determined solely by the existence
  of the status.json and results.json files.

Additional Context

Test Steps

Checklist

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

Copy link
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @TheRealFalcon, huge maintainability improvements here - and a much better UI too. Thanks for doing this. I have a couple of minor change requests inline - mostly just comments and questions.

We have a couple of strings that still need to be updated in the docs, comments, and tests:

$ rg "degraded running"
cloudinit/cmd/status.py
165:    # Handle the "degraded done" and "degraded running" states

doc/rtd/howto/status.rst
58:    "degraded running"

$ rg "degraded done"
doc/rtd/howto/status.rst
43:      "extended_status": "degraded done",
57:    "degraded done"

doc/rtd/explanation/exported_errors.rst
27:      "extended_status": "degraded done",
83:      "extended_status": "degraded done",

cloudinit/cmd/status.py
165:    # Handle the "degraded done" and "degraded running" states

tests/unittests/cmd/test_status.py
671:                    "extended_status": "degraded done",

@@ -32,8 +34,9 @@
{},
)
STATUS_DETAILS_NOT_RUN = status.StatusDetails(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was formerly named after the UXAppStatus it contained, however we rename NOT_RUN to NOT_STARTED under RunningStatus. Could we update this variable to also follow the rename?

i.e.

Suggested change
STATUS_DETAILS_NOT_RUN = status.StatusDetails(
STATUS_DETAILS_NOT_STARTED = status.StatusDetails(

and the variable updated throughout?

cloudinit/cmd/status.py Show resolved Hide resolved
description = "Failed due to systemd unit failure"
errors.append(
"Failed due to sysetmd unit failure. Ensure all cloud-init "
"services are enabled, and check 'systemctl' or 'journalctl' "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which service to check is unknown to the user by this message, but we could easily return the service name that failed from systemd_failed() for a more informative error message.

I'm not requesting this change now, but it's an idea for a future improvement we could make.


if (
running_status == RunningStatus.RUNNING
and uses_systemd()
Copy link
Member

@holmanb holmanb Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might eventually want to abstract systemd_failed() into the distros/ directory (or maybe even an init-system-specific abstraction) so that other init systems can be better extended to check status. This isn't something we need to do today, just an idea for the future.

status_file, result_file, boot_status_code, latest_event
) -> RunningStatus:
"""Return the running status of cloud-init."""
if is_running(status_file, result_file):
Copy link
Member

@holmanb holmanb Feb 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic normally should work, but I think it would be slightly safer to do if boot_status_code in DISABLED_BOOT_CODES before is_running(status_file, result_file) due to edge cases. This is because I think that boot_status_code is a higher confidence signal than the file existence.

A system that experiences a cloud-init crash that leaves behind a status file but not a result file will return RunningStatus.RUNNING (non-systemd). While this makes sense, if cloud-init is subsequently disabled we will still report RunningStatus.RUNNING due to the existence of this file.

cloudinit/cmd/status.py Show resolved Hide resolved
cloudinit/cmd/status.py Outdated Show resolved Hide resolved
) -> str:
"""Query systemd with retries and return output."""
while True:
try:
return subp.subp(["systemctl", *systemctl_args]).stdout.strip()
except subp.ProcessExecutionError as e:
if existing_status and existing_status in (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, the message to stderr below might get printed once from both get_status_details()::get_bootstatus() and again in get_status_details()::systemd_failed() when cloud-init status is called early in boot. Is this intentional? +1 either way, but it just seems unnecessary from a user's perspective.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored it some to account for this. Let me know if that works for you

cloudinit/cmd/status.py Show resolved Hide resolved
"""
# If we're done and have errors, we're in an error state
if condition == ConditionStatus.ERROR:
return ("error", f"{condition.value} - {running.value}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style question: Was it intentional to use parenthesis on this line and the next return but not the final one? I personally prefer never using them when returning tuples, but I'm curious if there is some stylistic reason for doing it this way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, just lack of conscious thought. I was also notorious at a previous company for mixing double quotes and single quotes in the same file/function, but thankfully black has made that a non-issue for me. 😄

I removed the parens from these.

The one major functional change in this commit is around how we detect
running vs error states. Status reporting has a fundamental problem in
that we can't accurately tell if cloud-init is done because cloud-init
is actually several processes. There isn't always a way to tell whether
a service isn't running because it simply hasn't started yet vs the
service being blocked/crashed and will never start/finish.

In the past, if any of the cloud-init services reported an error, we
would assume that cloud-init as a whole has crashed and report that
cloud-init is "done", but with error. This commit flips that logic to
assume that cloud-init is always running unless we see indication that
cloud-init has completely finished. This means that
`cloud-init status --wait` may run forever if cloud-init has crashed or
is blocked on another service. This is preferable to returning early
and potentially allowing provisioning scripts that wait for cloud-init
to continue provisioning. On systemd-enabled systems, there is extra
logic to inspect the state of the services, so this should rarely be a
problem in practice.

Additionally, this commit includes the following refactoring:
- Split UXAppStatus into RunningStatus and ConditionStatus so they can
  be tracked independently
- Simplify the tabular printing
- On error print extended_status as "error - done" or "error - running"
- Add several helper functions in `get_status_details` to simplify
  logic
- Rename `_get_error_or_running_from_systemd` to `systemd_failed`
  and only return if error is detected
- Change "is running" logic to be determined solely by the existence
  of the status.json and results.json files.
@TheRealFalcon
Copy link
Member Author

Also force pushed a rebase

Copy link
Member

@holmanb holmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @TheRealFalcon!

@TheRealFalcon TheRealFalcon merged commit d175170 into canonical:main Feb 24, 2024
29 checks passed
@TheRealFalcon TheRealFalcon deleted the status branch February 24, 2024 04:49
blackboxsw pushed a commit that referenced this pull request Feb 27, 2024
The one major functional change in this commit is around how we detect
running vs error states. Status reporting has a fundamental problem in
that we can't accurately tell if cloud-init is done because cloud-init
is actually several processes. There isn't always a way to tell whether
a service isn't running because it simply hasn't started yet vs the
service being blocked/crashed and will never start/finish.

In the past, if any of the cloud-init services reported an error, we
would assume that cloud-init as a whole has crashed and report that
cloud-init is "done", but with error. This commit flips that logic to
assume that cloud-init is always running unless we see indication that
cloud-init has completely finished. This means that
`cloud-init status --wait` may run forever if cloud-init has crashed or
is blocked on another service. This is preferable to returning early
and potentially allowing provisioning scripts that wait for cloud-init
to continue provisioning. On systemd-enabled systems, there is extra
logic to inspect the state of the services, so this should rarely be a
problem in practice.

Additionally, this commit includes the following refactoring:
- Split UXAppStatus into RunningStatus and ConditionStatus so they can
  be tracked independently
- Simplify the tabular printing
- On error print extended_status as "error - done" or "error - running"
- Add several helper functions in `get_status_details` to simplify
  logic
- Rename `_get_error_or_running_from_systemd` to `systemd_failed`
  and only return if error is detected
- Change "is running" logic to be determined solely by the existence
  of the status.json and results.json files.
blackboxsw pushed a commit that referenced this pull request Feb 27, 2024
The one major functional change in this commit is around how we detect
running vs error states. Status reporting has a fundamental problem in
that we can't accurately tell if cloud-init is done because cloud-init
is actually several processes. There isn't always a way to tell whether
a service isn't running because it simply hasn't started yet vs the
service being blocked/crashed and will never start/finish.

In the past, if any of the cloud-init services reported an error, we
would assume that cloud-init as a whole has crashed and report that
cloud-init is "done", but with error. This commit flips that logic to
assume that cloud-init is always running unless we see indication that
cloud-init has completely finished. This means that
`cloud-init status --wait` may run forever if cloud-init has crashed or
is blocked on another service. This is preferable to returning early
and potentially allowing provisioning scripts that wait for cloud-init
to continue provisioning. On systemd-enabled systems, there is extra
logic to inspect the state of the services, so this should rarely be a
problem in practice.

Additionally, this commit includes the following refactoring:
- Split UXAppStatus into RunningStatus and ConditionStatus so they can
  be tracked independently
- Simplify the tabular printing
- On error print extended_status as "error - done" or "error - running"
- Add several helper functions in `get_status_details` to simplify
  logic
- Rename `_get_error_or_running_from_systemd` to `systemd_failed`
  and only return if error is detected
- Change "is running" logic to be determined solely by the existence
  of the status.json and results.json files.
blackboxsw pushed a commit that referenced this pull request Feb 27, 2024
The one major functional change in this commit is around how we detect
running vs error states. Status reporting has a fundamental problem in
that we can't accurately tell if cloud-init is done because cloud-init
is actually several processes. There isn't always a way to tell whether
a service isn't running because it simply hasn't started yet vs the
service being blocked/crashed and will never start/finish.

In the past, if any of the cloud-init services reported an error, we
would assume that cloud-init as a whole has crashed and report that
cloud-init is "done", but with error. This commit flips that logic to
assume that cloud-init is always running unless we see indication that
cloud-init has completely finished. This means that
`cloud-init status --wait` may run forever if cloud-init has crashed or
is blocked on another service. This is preferable to returning early
and potentially allowing provisioning scripts that wait for cloud-init
to continue provisioning. On systemd-enabled systems, there is extra
logic to inspect the state of the services, so this should rarely be a
problem in practice.

Additionally, this commit includes the following refactoring:
- Split UXAppStatus into RunningStatus and ConditionStatus so they can
  be tracked independently
- Simplify the tabular printing
- On error print extended_status as "error - done" or "error - running"
- Add several helper functions in `get_status_details` to simplify
  logic
- Rename `_get_error_or_running_from_systemd` to `systemd_failed`
  and only return if error is detected
- Change "is running" logic to be determined solely by the existence
  of the status.json and results.json files.
holmanb pushed a commit to holmanb/cloud-init that referenced this pull request Mar 5, 2024
The one major functional change in this commit is around how we detect
running vs error states. Status reporting has a fundamental problem in
that we can't accurately tell if cloud-init is done because cloud-init
is actually several processes. There isn't always a way to tell whether
a service isn't running because it simply hasn't started yet vs the
service being blocked/crashed and will never start/finish.

In the past, if any of the cloud-init services reported an error, we
would assume that cloud-init as a whole has crashed and report that
cloud-init is "done", but with error. This commit flips that logic to
assume that cloud-init is always running unless we see indication that
cloud-init has completely finished. This means that
`cloud-init status --wait` may run forever if cloud-init has crashed or
is blocked on another service. This is preferable to returning early
and potentially allowing provisioning scripts that wait for cloud-init
to continue provisioning. On systemd-enabled systems, there is extra
logic to inspect the state of the services, so this should rarely be a
problem in practice.

Additionally, this commit includes the following refactoring:
- Split UXAppStatus into RunningStatus and ConditionStatus so they can
  be tracked independently
- Simplify the tabular printing
- On error print extended_status as "error - done" or "error - running"
- Add several helper functions in `get_status_details` to simplify
  logic
- Rename `_get_error_or_running_from_systemd` to `systemd_failed`
  and only return if error is detected
- Change "is running" logic to be determined solely by the existence
  of the status.json and results.json files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants