Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve logging and test around systemd fallback logic #1683

Merged
merged 2 commits into from
Oct 25, 2019

Conversation

pgombar
Copy link
Contributor

@pgombar pgombar commented Oct 25, 2019

Description

Explicitly log when systemd extension invocation failed and we are about to attempt falling back to a regular subprocess.Popen. Also, a test case is added that mimics a systemd timeout.


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • Except for special cases involving multiple contributors, the PR is started from a fork of the main repository, not a branch.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made and Travis.CI is passing.

Quality of Code and Contribution Guidelines


This change is Reviewable

@codecov
Copy link

codecov bot commented Oct 25, 2019

Codecov Report

❗ No coverage uploaded for pull request base (develop@081604b). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop    #1683   +/-   ##
==========================================
  Coverage           ?   67.35%           
==========================================
  Files              ?       80           
  Lines              ?    11435           
  Branches           ?     1605           
==========================================
  Hits               ?     7702           
  Misses             ?     3393           
  Partials           ?      340
Impacted Files Coverage Δ
azurelinuxagent/common/cgroupapi.py 83.01% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 081604b...5b3117e. Read the comment docs.

@@ -510,6 +510,8 @@ def create_cgroup(controller):
stderr.truncate(0)

# Try invoking the process again, this time without systemd-run
logger.info('Extension invocation using systemd failed, falling back to regular invocation '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already logging systemd failure (L502), why do we need another info log for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more clear from the logs that we're in the fallback logic branch of this if; the logs I was examining yesterday were pretty confusing and I couldn't tell if we were firing it or not without going through the code.

stderr=stderr)

self.assertEquals(extension_cgroups, [])
self.assertEquals(expected_output.format("success"), process_output)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we assert that first we tried with systemd, and then without it?

other than that LGTM, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. Same comment. Checking call_count for both log.info as well as popen check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think checking Popen's call count and arguments are strong enough, checking log.info's call count seems too dependent on the implementation of something that's not crucial to the scenario.

self.assertEquals(expected_output.format("success"), process_output)

@patch('time.sleep', side_effect=lambda _: mock_sleep(0.001))
def test_start_extension_command_should_use_fallback_option_if_systemd_times_out_externally(self, _):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, what is the difference of the new test with this one? (test_start_extension_command_should_use_fallback_option_if_systemd_times_out_externally)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _internally method mimics what happens when systemd times out itself, the _externally method mimics what happens when we time it out. I wanted to keep both since we're not in control of the timeout threshold of systemd and it might be larger or smaller than how we define our extension timeout.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe when we time it out.? If we are timing out, then where are we validating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After inspecting logs where systemd reported "Connection timed out", it's clear that its internal timeout is significantly shorter than what we define for extension operations (~1 minute versus 5 minutes for extensions), so I'm going to drop the _externally test because:

  1. it's currently broken (prompts for sudo but then succeeds even without sudo)
  2. it doesn't test anything of value (we are not actually causing systemd to timeout/hang before starting the extension operation, but mocking it).

The _internally test (which I will rename to just systemd_times_out) covers this scenario.

# extension operations. When systemd times out, it will write that to stderr and exit with exit code 1.
# In that case, we will internally recognize the failure due to the non-zero exit code, not as a timeout.
original_popen = subprocess.Popen
systemd_timeout_command = "echo 'Failed to start transient scope unit: Connection timed out' >&2 && exit 1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I was going through the systemd fallback code and I noticed something, it doesnt matter if its a systemd timeout or a systemd execution failure (exit non-0), the exception is handled the same way. If that's the case then any particular reason why the tests are named 'systemd_times_out'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation vs behavior :)

though the implementation currently handles errors the same (whether they are time outs or not) we want to assert the behavior of important scenarios, timeout being one of them.

if the implementation changes, we want the behavior to remain the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh nice, I get the motivation behind writing the test.
1 follow up though, if we're checking the behavior then shouldn't we be testing the behavior? Like for example this test has exit 1 to fail the test which is not necessarily a systemd timeout. If we're validation a scenario then shouldn't that actually mock systemd timeout?

Copy link
Member

@narrieta narrieta Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, absolutely. When systemd times out it exits with 1, no stdout and a specific message in stderr. The mock replicates this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh very interesting, thanks for sharing! This is an interesting systemd behavior I was unaware of! :)

Copy link
Member

@narrieta narrieta Oct 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One may argue that the mock should also sleep for the timeout period, but that would not fit with a unit test, I think (too slow). An integration test may be worth considering.

Copy link
Contributor

@larohra larohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

self.assertEquals(extension_cgroups, [])
self.assertEquals(expected_output.format("success"), process_output)
# We expect two calls to Popen, first for the systemd-run call, second for the fallback option
self.assertEquals(2, patch_mock_popen.call_count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can have a stronger check if you assert that the arguments to the first call are wrapped with systemd-run and the arguments for the second call are not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is done just 2 lines after my comment

oh, my

LGTM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy friday, Norberto! :D

@pgombar pgombar merged commit ddc3f22 into Azure:develop Oct 25, 2019
@pgombar pgombar deleted the improve_systemd_fallback branch October 25, 2019 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants