-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable cgroups when daemon is setup incorrectly #1688
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #1688 +/- ##
==========================================
Coverage ? 67.61%
==========================================
Files ? 80
Lines ? 11477
Branches ? 1608
==========================================
Hits ? 7760
Misses ? 3375
Partials ? 342
Continue to review full report at Codecov.
|
msg = 'Error in cgroup controller "{0}": {1}. {2}'.format(controller, ustr(e), message) | ||
logger.warn(msg) | ||
errors.append(msg) | ||
return errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now there is no need for this function to report errors to the caller
from tests.tools import * | ||
|
||
|
||
class CGroupsApiTestCase(AgentTestCase): | ||
def setUp(self): | ||
class _MockedFileSystemTestCase(AgentTestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored common mocks to this class
root_slice_name = SystemdCgroupsApi()._get_extensions_slice_root_name() | ||
self.assertEqual(root_slice_name, "system-walinuxagent.extensions.slice") | ||
|
||
def test_it_should_return_extension_slice_name(self): | ||
def test_get_extension_slice_name_should_return_the_slice_for_the_given_extension(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced 'it' with the actual test subject in these method names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one more general question: should we explicitly check if the daemon's (and extension handler's) PID is in the correct place, and, if not, also disable cgroups? Right now we're just checking if the PID is in the wrong place, not if it's in the right place, if that makes sense.
So, if the daemon and/or exthandler's PID is not in walinuxagent.service
, then traverse the /proc/
filesystem to find where it is, log it, and disable cgroups. This way, we might identify more wrong places where our PIDs might have wandered off to.
operation(controller, daemon_pid) | ||
finally: | ||
for _, cgroup in legacy_cgroups: | ||
logger.info('Removing {0}', cgroup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Suggestion to make the message Removing legacy cgroup {0}
instead to make it even clearer what's getting removed when just looking at the logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the whole path is in the message. a few lines up in the log there would be a message identifying the path as a legacy group (logged at line 132)
@@ -79,7 +79,7 @@ def enable(self): | |||
def disable(self): | |||
self._enabled = False | |||
|
|||
def _invoke_cgroup_operation(self, operation, error_message): | |||
def _invoke_cgroup_operation(self, operation, error_message, on_error=None): | |||
""" | |||
Ensures the given operation is invoked only if cgroups are enabled and traps any errors on the operation. | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the comment explaining what on_error
does and why it's there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't the name imply a callback to execute when an error occurs? :)
@@ -69,7 +69,6 @@ def run(self, child_args=None): | |||
self.initialize_environment() | |||
|
|||
CGroupConfigurator.get_instance().create_agent_cgroups(track_cgroups=False) | |||
CGroupConfigurator.get_instance().cleanup_old_cgroups() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have the daemon clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not needed. Or did you have an scenario in mind when it was added?
from azurelinuxagent.common.future import ustr | ||
from azurelinuxagent.common.utils import shellutil | ||
from nose.plugins.attrib import attr | ||
from tests.utils.cgroups_tools import CGroupsTools | ||
from tests.tools import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor import *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked into it when I started modifying this test but backed off. There is some funny stuff there that is going to require more time than I have in the current release.
Let's see who in the team gets there first.
os.mkdir(os.path.join(self.cgroups_file_system_root, "cpu")) | ||
os.mkdir(os.path.join(self.cgroups_file_system_root, "memory")) | ||
(message_format, controller, error, message) = args | ||
self.assertEquals(message_format, 'Error in cgroup controller "{0}": {1}. {2}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this will abstract away whatever's in {}
? Cool, did not know this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was actually needed because I changed the statement that logs this message. Previously it was logging a string formatted with string.format, now it lets the logger do the formatting so the first argument is actually the formatting string.
from azurelinuxagent.common.cgroupconfigurator import CGroupConfigurator | ||
from azurelinuxagent.common.cgroupstelemetry import CGroupsTelemetry | ||
from azurelinuxagent.common.exception import CGroupsException | ||
from azurelinuxagent.common.osutil.default import DefaultOSUtil | ||
from tests.utils.cgroups_tools import CGroupsTools | ||
from tests.tools import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another import *
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i know :)
logger.info("Writing daemon's PID ({0}) to {1}", daemon_pid, new_path) | ||
fileutil.append_file(os.path.join(new_path, "cgroup.procs"), daemon_pid) | ||
msg = "Moved daemon's PID from legacy cgroup to {0}".format(new_path) | ||
add_event(AGENT_NAME, version=CURRENT_VERSION, op=WALAEventOperation.CGroupsCleanUp, is_success=True, message=msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we explicitly say that the collection of resource usage data is still enabled? I'm looking at it from the perspective of querying for this event and parsing what the different event messages mean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can talk offline... not sure I understand how that would improve parsing
@@ -55,7 +55,11 @@ class CGroupsException(AgentError): | |||
|
|||
def __init__(self, msg, inner=None): | |||
super(AgentError, self).__init__(msg, inner) | |||
# TODO: AgentError should set the message - investigate whether doing it there would break anything |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... would it? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dunno; no time to check :)
Doing it in AgentError would change the formatting of all exceptions. Need to spend some time checking what would happen then.
We could add more checks on a later release |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comments.
Disable cgroups when, under systemd, the daemon's PID has been added to the file system explicitly.
Also disable when we cannot move the PID to walinuxagent.service (when not under systemd)
This change is