-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify that the CPU and Memory cgroups for the agent are properly initialized; disabled cgroups if they are not active. #2015
Conversation
@@ -76,12 +79,7 @@ def log_cgroup_info(format_string, *args): | |||
logger.info(message) | |||
add_event(op=WALAEventOperation.CGroupsInfo, message=message) | |||
|
|||
def log_cgroup_warn(format_string, *args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
converted warnings into infos
else: | ||
cpu_accounting = self._cgroups_api.get_unit_property(agent_unit_name, "CPUAccounting") # pylint: disable=E1101 | ||
if cpu_cgroup_relative_path != expected_relative_path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not enable cgroups if agent is not in system.slice/walinuxagent.service
CGroupsTelemetry.track_cgroup(MemoryCgroup(agent_unit_name, self._agent_memory_cgroup_path)) | ||
|
||
log_cgroup_info("Agent cgroups: CPU: {0} -- MEMORY: {1}", self._agent_cpu_cgroup_path, self._agent_memory_cgroup_path) | ||
if self._agent_cpu_cgroup_path is not None or self._agent_memory_cgroup_path is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not enable cgroups if both cpu and memory are None
PeriodicOperation("send_host_plugin_heartbeat", self.send_host_plugin_heartbeat, self.HOST_PLUGIN_HEARTBEAT_PERIOD), | ||
PeriodicOperation("send_imds_heartbeat", self.send_imds_heartbeat, self.IMDS_HEARTBEAT_PERIOD), | ||
ReportNetworkConfigurationChangesOperation(), | ||
] | ||
if CGroupConfigurator.get_instance().enabled(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't check for cgroups stuff if they are disabled
@@ -267,6 +268,24 @@ def run(self, debug=False): # pylint: disable=R0912,R0914 | |||
logger.info(os_info_msg) | |||
add_event(AGENT_NAME, op=WALAEventOperation.OSInfo, message=os_info_msg) | |||
|
|||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialization of these threads was racing against initialization of the main thread
# | ||
# Perform initialization tasks | ||
# | ||
from azurelinuxagent.ga.exthandlers import get_exthandlers_handler, migrate_handler_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this import be up top of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya; i tried to fix that and it ends up breaking up the tests... we have several of those problematic imports across the code base... yet another cleanup task to do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. What were the errors the tests were having? Should we document this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can try moving it in the current code and check the errors... i don't think we need to document it, all of us are aware of this (now you are :))
if on_error is not None: | ||
try: | ||
on_error(e) | ||
except Exception as ex: # pylint: disable=W0612 | ||
logger.warn("CGroupConfigurator._invoke_cgroup_operation: {0}".format(ustr(e))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW on this one the linter caught me logging the wrong exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 minor comment, else LGTM
else: | ||
cpu_accounting = self._cgroups_api.get_unit_property(agent_unit_name, "CPUAccounting") # pylint: disable=E1101 | ||
if cpu_cgroup_relative_path != expected_relative_path: | ||
log_cgroup_info("The Agent is not in the expected cgroup; will not enable cgroup monitoring. CPU relative path:[{0}] Expected:[{1}]", cpu_cgroup_relative_path, expected_relative_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be a warning here?
any(cg for cg in tracked if cg.name == 'walinuxagent.service' and 'memory' in cg.path), | ||
"The Agent's memory is not being tracked") | ||
# protected-access<W0212> Disabled: OK to access CGroupConfigurator._tracked from unit test for CGroupConfigurator | ||
tracked = CGroupsTelemetry._tracked # pylint: disable=protected-access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this protected access ok here? It would be nice to keep unit tests agnostic of the internals of our prod modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in an ideal situation. I believe in this situation it is OK to break the rule; the unit tests do not make heavy use of internals, and this is unit test for A using A's internals. Things get dicey in other parts of the code where we have unit tests for A using internals of D when A uses B uses C uses D.
Exposing the internals of CGroupsTelemetry just the test's benefit doesn't seem right, and looking more more complicated way to do this check doesn't seem to be worth it.
On a small number of machines the controllers for CPU and/or memory were not active and the agent's cgroups were initialized to None. The resource usage data was not being collected, as expected, but the check for processes in the agent's cgroup was still being done, generating false positives about cgroups not working properly.
Also, added a check to disable cgroups if both cgroups are None.
On another set of machines, the agent was within the init.scope, or directly under services.slice; added a check to disable cgroups if the agent is not in services.slice/walinuxagent.service.
I also converted some WARNING messages to INFO, since there were a couple of questions when people noticed them in the agent log.
Lastly, addressed linter warnings in the code I touched.