Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that the CPU and Memory cgroups for the agent are properly initialized; disabled cgroups if they are not active. #2015

Merged
merged 5 commits into from
Sep 25, 2020

Conversation

narrieta
Copy link
Member

On a small number of machines the controllers for CPU and/or memory were not active and the agent's cgroups were initialized to None. The resource usage data was not being collected, as expected, but the check for processes in the agent's cgroup was still being done, generating false positives about cgroups not working properly.

Also, added a check to disable cgroups if both cgroups are None.

On another set of machines, the agent was within the init.scope, or directly under services.slice; added a check to disable cgroups if the agent is not in services.slice/walinuxagent.service.

I also converted some WARNING messages to INFO, since there were a couple of questions when people noticed them in the agent log.

Lastly, addressed linter warnings in the code I touched.

@@ -76,12 +79,7 @@ def log_cgroup_info(format_string, *args):
logger.info(message)
add_event(op=WALAEventOperation.CGroupsInfo, message=message)

def log_cgroup_warn(format_string, *args):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

converted warnings into infos

else:
cpu_accounting = self._cgroups_api.get_unit_property(agent_unit_name, "CPUAccounting") # pylint: disable=E1101
if cpu_cgroup_relative_path != expected_relative_path:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not enable cgroups if agent is not in system.slice/walinuxagent.service

CGroupsTelemetry.track_cgroup(MemoryCgroup(agent_unit_name, self._agent_memory_cgroup_path))

log_cgroup_info("Agent cgroups: CPU: {0} -- MEMORY: {1}", self._agent_cpu_cgroup_path, self._agent_memory_cgroup_path)
if self._agent_cpu_cgroup_path is not None or self._agent_memory_cgroup_path is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not enable cgroups if both cpu and memory are None

PeriodicOperation("send_host_plugin_heartbeat", self.send_host_plugin_heartbeat, self.HOST_PLUGIN_HEARTBEAT_PERIOD),
PeriodicOperation("send_imds_heartbeat", self.send_imds_heartbeat, self.IMDS_HEARTBEAT_PERIOD),
ReportNetworkConfigurationChangesOperation(),
]
if CGroupConfigurator.get_instance().enabled():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't check for cgroups stuff if they are disabled

@@ -267,6 +268,24 @@ def run(self, debug=False): # pylint: disable=R0912,R0914
logger.info(os_info_msg)
add_event(AGENT_NAME, op=WALAEventOperation.OSInfo, message=os_info_msg)

#
Copy link
Member Author

@narrieta narrieta Sep 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialization of these threads was racing against initialization of the main thread

#
# Perform initialization tasks
#
from azurelinuxagent.ga.exthandlers import get_exthandlers_handler, migrate_handler_state
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this import be up top of the file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya; i tried to fix that and it ends up breaking up the tests... we have several of those problematic imports across the code base... yet another cleanup task to do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm. What were the errors the tests were having? Should we document this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can try moving it in the current code and check the errors... i don't think we need to document it, all of us are aware of this (now you are :))

@narrieta narrieta closed this Sep 23, 2020
@narrieta narrieta deleted the cgroups branch September 23, 2020 18:17
@narrieta narrieta restored the cgroups branch September 23, 2020 18:17
@narrieta narrieta reopened this Sep 23, 2020
if on_error is not None:
try:
on_error(e)
except Exception as ex: # pylint: disable=W0612
logger.warn("CGroupConfigurator._invoke_cgroup_operation: {0}".format(ustr(e)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW on this one the linter caught me logging the wrong exception

Copy link
Contributor

@larohra larohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 minor comment, else LGTM

else:
cpu_accounting = self._cgroups_api.get_unit_property(agent_unit_name, "CPUAccounting") # pylint: disable=E1101
if cpu_cgroup_relative_path != expected_relative_path:
log_cgroup_info("The Agent is not in the expected cgroup; will not enable cgroup monitoring. CPU relative path:[{0}] Expected:[{1}]", cpu_cgroup_relative_path, expected_relative_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be a warning here?

any(cg for cg in tracked if cg.name == 'walinuxagent.service' and 'memory' in cg.path),
"The Agent's memory is not being tracked")
# protected-access<W0212> Disabled: OK to access CGroupConfigurator._tracked from unit test for CGroupConfigurator
tracked = CGroupsTelemetry._tracked # pylint: disable=protected-access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this protected access ok here? It would be nice to keep unit tests agnostic of the internals of our prod modules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in an ideal situation. I believe in this situation it is OK to break the rule; the unit tests do not make heavy use of internals, and this is unit test for A using A's internals. Things get dicey in other parts of the code where we have unit tests for A using internals of D when A uses B uses C uses D.
Exposing the internals of CGroupsTelemetry just the test's benefit doesn't seem right, and looking more more complicated way to do this check doesn't seem to be worth it.

@narrieta narrieta merged commit 287eeb7 into Azure:develop Sep 25, 2020
@narrieta narrieta deleted the cgroups branch September 25, 2020 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants