Improve logging and test around systemd fallback logic #1683

pgombar · 2019-10-25T18:16:09Z

Description

Explicitly log when systemd extension invocation failed and we are about to attempt falling back to a regular subprocess.Popen. Also, a test case is added that mimics a systemd timeout.

PR information

The title of the PR is clear and informative.
There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
Except for special cases involving multiple contributors, the PR is started from a fork of the main repository, not a branch.
If applicable, the PR references the bug/issue that it fixes in the description.
New Unit tests were added for the changes made and Travis.CI is passing.

Quality of Code and Contribution Guidelines

I have read the contribution guidelines.

This change is

…extension invocation

codecov · 2019-10-25T18:18:35Z

Codecov Report

❗ No coverage uploaded for pull request base (develop@081604b). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff             @@
##             develop    #1683   +/-   ##
==========================================
  Coverage           ?   67.35%           
==========================================
  Files              ?       80           
  Lines              ?    11435           
  Branches           ?     1605           
==========================================
  Hits               ?     7702           
  Misses             ?     3393           
  Partials           ?      340

Impacted Files	Coverage Δ
azurelinuxagent/common/cgroupapi.py	`83.01% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 081604b...5b3117e. Read the comment docs.

larohra · 2019-10-25T18:21:25Z

azurelinuxagent/common/cgroupapi.py

@@ -510,6 +510,8 @@ def create_cgroup(controller):
                stderr.truncate(0)

                # Try invoking the process again, this time without systemd-run
+                logger.info('Extension invocation using systemd failed, falling back to regular invocation '


We're already logging systemd failure (L502), why do we need another info log for it?

To make it more clear from the logs that we're in the fallback logic branch of this if; the logs I was examining yesterday were pretty confusing and I couldn't tell if we were firing it or not without going through the code.

narrieta · 2019-10-25T18:23:29Z

tests/common/test_cgroupapi.py

+                        stderr=stderr)
+
+                    self.assertEquals(extension_cgroups, [])
+                    self.assertEquals(expected_output.format("success"), process_output)


can we assert that first we tried with systemd, and then without it?

other than that LGTM, thanks!

I'll add that, thanks.

Yup. Same comment. Checking call_count for both log.info as well as popen check.

I think checking Popen's call count and arguments are strong enough, checking log.info's call count seems too dependent on the implementation of something that's not crucial to the scenario.

narrieta · 2019-10-25T18:33:13Z

tests/common/test_cgroupapi.py

+                    self.assertEquals(expected_output.format("success"), process_output)
+
+    @patch('time.sleep', side_effect=lambda _: mock_sleep(0.001))
+    def test_start_extension_command_should_use_fallback_option_if_systemd_times_out_externally(self, _):


actually, what is the difference of the new test with this one? (test_start_extension_command_should_use_fallback_option_if_systemd_times_out_externally)

The _internally method mimics what happens when systemd times out itself, the _externally method mimics what happens when we time it out. I wanted to keep both since we're not in control of the timeout threshold of systemd and it might be larger or smaller than how we define our extension timeout.

Can you describe when we time it out.? If we are timing out, then where are we validating it?

After inspecting logs where systemd reported "Connection timed out", it's clear that its internal timeout is significantly shorter than what we define for extension operations (~1 minute versus 5 minutes for extensions), so I'm going to drop the _externally test because:

it's currently broken (prompts for sudo but then succeeds even without sudo)

it doesn't test anything of value (we are not actually causing systemd to timeout/hang before starting the extension operation, but mocking it).

The _internally test (which I will rename to just systemd_times_out) covers this scenario.

larohra · 2019-10-25T18:43:04Z

tests/common/test_cgroupapi.py

+        # extension operations. When systemd times out, it will write that to stderr and exit with exit code 1.
+        # In that case, we will internally recognize the failure due to the non-zero exit code, not as a timeout.
+        original_popen = subprocess.Popen
+        systemd_timeout_command = "echo 'Failed to start transient scope unit: Connection timed out' >&2 && exit 1"


Also I was going through the systemd fallback code and I noticed something, it doesnt matter if its a systemd timeout or a systemd execution failure (exit non-0), the exception is handled the same way. If that's the case then any particular reason why the tests are named 'systemd_times_out'?

implementation vs behavior :)

though the implementation currently handles errors the same (whether they are time outs or not) we want to assert the behavior of important scenarios, timeout being one of them.

if the implementation changes, we want the behavior to remain the same.

Ohh nice, I get the motivation behind writing the test.
1 follow up though, if we're checking the behavior then shouldn't we be testing the behavior? Like for example this test has exit 1 to fail the test which is not necessarily a systemd timeout. If we're validation a scenario then shouldn't that actually mock systemd timeout?

Yup, absolutely. When systemd times out it exits with 1, no stdout and a specific message in stderr. The mock replicates this behavior.

Ahh very interesting, thanks for sharing! This is an interesting systemd behavior I was unaware of! :)

One may argue that the mock should also sleep for the timeout period, but that would not fit with a unit test, I think (too slow). An integration test may be worth considering.

larohra

LGTM

narrieta · 2019-10-25T22:55:18Z

tests/common/test_cgroupapi.py

-                        self.assertEquals(extension_cgroups, [])
-                        self.assertEquals(expected_output.format("success"), process_output)
+                    # We expect two calls to Popen, first for the systemd-run call, second for the fallback option
+                    self.assertEquals(2, patch_mock_popen.call_count)


you can have a stronger check if you assert that the arguments to the first call are wrapped with systemd-run and the arguments for the second call are not.

which is done just 2 lines after my comment

oh, my

LGTM

Happy friday, Norberto! :D

add systemd internal timeout test and log when invoking fallback for …

8fe8056

…extension invocation

pgombar requested review from larohra, narrieta and vrdmr as code owners October 25, 2019 18:16

larohra reviewed Oct 25, 2019

View reviewed changes

narrieta reviewed Oct 25, 2019

View reviewed changes

larohra reviewed Oct 25, 2019

View reviewed changes

larohra approved these changes Oct 25, 2019

View reviewed changes

make stronger assertions and remove unnecessary test

5b3117e

narrieta reviewed Oct 25, 2019

View reviewed changes

narrieta approved these changes Oct 25, 2019

View reviewed changes

pgombar merged commit ddc3f22 into Azure:develop Oct 25, 2019

pgombar deleted the improve_systemd_fallback branch October 25, 2019 23:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve logging and test around systemd fallback logic #1683

Improve logging and test around systemd fallback logic #1683

pgombar commented Oct 25, 2019 •

edited by vrdmr

Loading

codecov bot commented Oct 25, 2019 •

edited

Loading

larohra Oct 25, 2019

pgombar Oct 25, 2019

narrieta Oct 25, 2019

pgombar Oct 25, 2019

vrdmr Oct 25, 2019

pgombar Oct 25, 2019

narrieta Oct 25, 2019

pgombar Oct 25, 2019

vrdmr Oct 25, 2019

pgombar Oct 25, 2019

larohra Oct 25, 2019

narrieta Oct 25, 2019

larohra Oct 25, 2019

narrieta Oct 25, 2019 •

edited

Loading

larohra Oct 25, 2019

narrieta Oct 25, 2019 •

edited

Loading

larohra left a comment

narrieta Oct 25, 2019

narrieta Oct 25, 2019

pgombar Oct 25, 2019

Improve logging and test around systemd fallback logic #1683

Improve logging and test around systemd fallback logic #1683

Conversation

pgombar commented Oct 25, 2019 • edited by vrdmr Loading

Description

PR information

Quality of Code and Contribution Guidelines

codecov bot commented Oct 25, 2019 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narrieta Oct 25, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narrieta Oct 25, 2019 • edited Loading

Choose a reason for hiding this comment

larohra left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgombar commented Oct 25, 2019 •

edited by vrdmr

Loading

codecov bot commented Oct 25, 2019 •

edited

Loading

narrieta Oct 25, 2019 •

edited

Loading

narrieta Oct 25, 2019 •

edited

Loading