Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable cgroups when daemon is setup incorrectly #1688

Merged
merged 3 commits into from
Nov 4, 2019

Conversation

narrieta
Copy link
Member

@narrieta narrieta commented Oct 31, 2019

Disable cgroups when, under systemd, the daemon's PID has been added to the file system explicitly.

Also disable when we cannot move the PID to walinuxagent.service (when not under systemd)


This change is Reviewable

@codecov
Copy link

codecov bot commented Oct 31, 2019

Codecov Report

❗ No coverage uploaded for pull request base (develop@6211315). Click here to learn what that means.
The diff coverage is 93.22%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop    #1688   +/-   ##
==========================================
  Coverage           ?   67.61%           
==========================================
  Files              ?       80           
  Lines              ?    11477           
  Branches           ?     1608           
==========================================
  Hits               ?     7760           
  Misses             ?     3375           
  Partials           ?      342
Impacted Files Coverage Δ
azurelinuxagent/daemon/main.py 72.91% <ø> (ø)
azurelinuxagent/common/cgroupstelemetry.py 94.69% <ø> (ø)
azurelinuxagent/common/exception.py 94.56% <100%> (ø)
azurelinuxagent/ga/update.py 88.42% <100%> (ø)
azurelinuxagent/common/cgroupconfigurator.py 90.1% <88.88%> (ø)
azurelinuxagent/common/cgroupapi.py 84.7% <94.59%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6211315...50a0633. Read the comment docs.

msg = 'Error in cgroup controller "{0}": {1}. {2}'.format(controller, ustr(e), message)
logger.warn(msg)
errors.append(msg)
return errors
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now there is no need for this function to report errors to the caller

from tests.tools import *


class CGroupsApiTestCase(AgentTestCase):
def setUp(self):
class _MockedFileSystemTestCase(AgentTestCase):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored common mocks to this class

root_slice_name = SystemdCgroupsApi()._get_extensions_slice_root_name()
self.assertEqual(root_slice_name, "system-walinuxagent.extensions.slice")

def test_it_should_return_extension_slice_name(self):
def test_get_extension_slice_name_should_return_the_slice_for_the_given_extension(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced 'it' with the actual test subject in these method names

Copy link
Contributor

@pgombar pgombar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with one more general question: should we explicitly check if the daemon's (and extension handler's) PID is in the correct place, and, if not, also disable cgroups? Right now we're just checking if the PID is in the wrong place, not if it's in the right place, if that makes sense.

So, if the daemon and/or exthandler's PID is not in walinuxagent.service, then traverse the /proc/ filesystem to find where it is, log it, and disable cgroups. This way, we might identify more wrong places where our PIDs might have wandered off to.

operation(controller, daemon_pid)
finally:
for _, cgroup in legacy_cgroups:
logger.info('Removing {0}', cgroup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Suggestion to make the message Removing legacy cgroup {0} instead to make it even clearer what's getting removed when just looking at the logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the whole path is in the message. a few lines up in the log there would be a message identifying the path as a legacy group (logged at line 132)

@@ -79,7 +79,7 @@ def enable(self):
def disable(self):
self._enabled = False

def _invoke_cgroup_operation(self, operation, error_message):
def _invoke_cgroup_operation(self, operation, error_message, on_error=None):
"""
Ensures the given operation is invoked only if cgroups are enabled and traps any errors on the operation.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the comment explaining what on_error does and why it's there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't the name imply a callback to execute when an error occurs? :)

@@ -69,7 +69,6 @@ def run(self, child_args=None):
self.initialize_environment()

CGroupConfigurator.get_instance().create_agent_cgroups(track_cgroups=False)
CGroupConfigurator.get_instance().cleanup_old_cgroups()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have the daemon clean up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not needed. Or did you have an scenario in mind when it was added?

from azurelinuxagent.common.future import ustr
from azurelinuxagent.common.utils import shellutil
from nose.plugins.attrib import attr
from tests.utils.cgroups_tools import CGroupsTools
from tests.tools import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor import *?

Copy link
Member Author

@narrieta narrieta Nov 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked into it when I started modifying this test but backed off. There is some funny stuff there that is going to require more time than I have in the current release.

Let's see who in the team gets there first.

os.mkdir(os.path.join(self.cgroups_file_system_root, "cpu"))
os.mkdir(os.path.join(self.cgroups_file_system_root, "memory"))
(message_format, controller, error, message) = args
self.assertEquals(message_format, 'Error in cgroup controller "{0}": {1}. {2}')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this will abstract away whatever's in {}? Cool, did not know this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was actually needed because I changed the statement that logs this message. Previously it was logging a string formatted with string.format, now it lets the logger do the formatting so the first argument is actually the formatting string.

from azurelinuxagent.common.cgroupconfigurator import CGroupConfigurator
from azurelinuxagent.common.cgroupstelemetry import CGroupsTelemetry
from azurelinuxagent.common.exception import CGroupsException
from azurelinuxagent.common.osutil.default import DefaultOSUtil
from tests.utils.cgroups_tools import CGroupsTools
from tests.tools import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another import *

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i know :)

logger.info("Writing daemon's PID ({0}) to {1}", daemon_pid, new_path)
fileutil.append_file(os.path.join(new_path, "cgroup.procs"), daemon_pid)
msg = "Moved daemon's PID from legacy cgroup to {0}".format(new_path)
add_event(AGENT_NAME, version=CURRENT_VERSION, op=WALAEventOperation.CGroupsCleanUp, is_success=True, message=msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we explicitly say that the collection of resource usage data is still enabled? I'm looking at it from the perspective of querying for this event and parsing what the different event messages mean.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can talk offline... not sure I understand how that would improve parsing

@@ -55,7 +55,11 @@ class CGroupsException(AgentError):

def __init__(self, msg, inner=None):
super(AgentError, self).__init__(msg, inner)
# TODO: AgentError should set the message - investigate whether doing it there would break anything
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... would it? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dunno; no time to check :)

Doing it in AgentError would change the formatting of all exceptions. Need to spend some time checking what would happen then.

@narrieta
Copy link
Member Author

narrieta commented Nov 1, 2019

@pgombar

should we explicitly check if the daemon's (and extension handler's) PID is in the correct place, and, if not, also disable cgroups?

We could add more checks on a later release

Copy link
Member

@vrdmr vrdmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments.

azurelinuxagent/common/cgroupapi.py Show resolved Hide resolved
azurelinuxagent/ga/update.py Show resolved Hide resolved
@narrieta narrieta merged commit 68176db into Azure:develop Nov 4, 2019
@narrieta narrieta deleted the systemd-check branch December 3, 2019 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants