-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve logging and test around systemd fallback logic #1683
Conversation
…extension invocation
Codecov Report
@@ Coverage Diff @@
## develop #1683 +/- ##
==========================================
Coverage ? 67.35%
==========================================
Files ? 80
Lines ? 11435
Branches ? 1605
==========================================
Hits ? 7702
Misses ? 3393
Partials ? 340
Continue to review full report at Codecov.
|
@@ -510,6 +510,8 @@ def create_cgroup(controller): | |||
stderr.truncate(0) | |||
|
|||
# Try invoking the process again, this time without systemd-run | |||
logger.info('Extension invocation using systemd failed, falling back to regular invocation ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already logging systemd failure (L502), why do we need another info log for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it more clear from the logs that we're in the fallback logic branch of this if; the logs I was examining yesterday were pretty confusing and I couldn't tell if we were firing it or not without going through the code.
tests/common/test_cgroupapi.py
Outdated
stderr=stderr) | ||
|
||
self.assertEquals(extension_cgroups, []) | ||
self.assertEquals(expected_output.format("success"), process_output) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we assert that first we tried with systemd, and then without it?
other than that LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. Same comment. Checking call_count for both log.info as well as popen check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think checking Popen's call count and arguments are strong enough, checking log.info's call count seems too dependent on the implementation of something that's not crucial to the scenario.
tests/common/test_cgroupapi.py
Outdated
self.assertEquals(expected_output.format("success"), process_output) | ||
|
||
@patch('time.sleep', side_effect=lambda _: mock_sleep(0.001)) | ||
def test_start_extension_command_should_use_fallback_option_if_systemd_times_out_externally(self, _): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, what is the difference of the new test with this one? (test_start_extension_command_should_use_fallback_option_if_systemd_times_out_externally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _internally
method mimics what happens when systemd times out itself, the _externally
method mimics what happens when we time it out. I wanted to keep both since we're not in control of the timeout threshold of systemd and it might be larger or smaller than how we define our extension timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you describe when we time it out.
? If we are timing out, then where are we validating it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After inspecting logs where systemd reported "Connection timed out", it's clear that its internal timeout is significantly shorter than what we define for extension operations (~1 minute versus 5 minutes for extensions), so I'm going to drop the _externally
test because:
- it's currently broken (prompts for sudo but then succeeds even without sudo)
- it doesn't test anything of value (we are not actually causing systemd to timeout/hang before starting the extension operation, but mocking it).
The _internally
test (which I will rename to just systemd_times_out
) covers this scenario.
# extension operations. When systemd times out, it will write that to stderr and exit with exit code 1. | ||
# In that case, we will internally recognize the failure due to the non-zero exit code, not as a timeout. | ||
original_popen = subprocess.Popen | ||
systemd_timeout_command = "echo 'Failed to start transient scope unit: Connection timed out' >&2 && exit 1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I was going through the systemd fallback code and I noticed something, it doesnt matter if its a systemd timeout or a systemd execution failure (exit non-0), the exception is handled the same way. If that's the case then any particular reason why the tests are named 'systemd_times_out'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementation vs behavior :)
though the implementation currently handles errors the same (whether they are time outs or not) we want to assert the behavior of important scenarios, timeout being one of them.
if the implementation changes, we want the behavior to remain the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohh nice, I get the motivation behind writing the test.
1 follow up though, if we're checking the behavior then shouldn't we be testing the behavior? Like for example this test has exit 1
to fail the test which is not necessarily a systemd timeout. If we're validation a scenario then shouldn't that actually mock systemd timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, absolutely. When systemd times out it exits with 1, no stdout and a specific message in stderr. The mock replicates this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh very interesting, thanks for sharing! This is an interesting systemd behavior I was unaware of! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One may argue that the mock should also sleep for the timeout period, but that would not fit with a unit test, I think (too slow). An integration test may be worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
self.assertEquals(extension_cgroups, []) | ||
self.assertEquals(expected_output.format("success"), process_output) | ||
# We expect two calls to Popen, first for the systemd-run call, second for the fallback option | ||
self.assertEquals(2, patch_mock_popen.call_count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can have a stronger check if you assert that the arguments to the first call are wrapped with systemd-run and the arguments for the second call are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is done just 2 lines after my comment
oh, my
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy friday, Norberto! :D
Description
Explicitly log when systemd extension invocation failed and we are about to attempt falling back to a regular subprocess.Popen. Also, a test case is added that mimics a systemd timeout.
PR information
Quality of Code and Contribution Guidelines
This change is