Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor cgroup controllers #3135

Merged
merged 9 commits into from
Jun 16, 2024

Conversation

maddieford
Copy link
Contributor

@maddieford maddieford commented May 31, 2024

Description

The current code for cgroups is overcomplicated because it keeps track of each controller path separately. If we want to track additional controllers in the future with the existing code, we would need to update each block of code that references controller paths. In cgroup v2, there is a unified cgroup path, which makes it unnecessary to track different paths for each controller. 

The changes in this PR simplify the code by creating an abstraction for a cgroup which handles any operations that require cgroup paths. The changes include renaming the CGroup/CpuCGroup/MemoryCGroup classes to ControllerMetrics/CpuMetrics/MemoryMetrics and introduces the Cgroup/CgroupV1/CgroupV2 classes in the cgroupapi to represent a cgroup. The cgroupapi was updated to return instances of a Cgroup instead of tuples of controller paths.

Note this PR does not yet add logic to track cpu/memory metrics in v2. Those changes will be included in a future PR.

Test run of all cgroup scenarios: https://dev.azure.com/cplatruntime/WALinuxAgent/_build/results?buildId=9290&view=results

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

* Refactor Cgroup, CpuCgroup, MemoryCgroup to ControllerMetrics, CpuMetrics, MemoryMetrics

* Create methods to get unit/process cgroup representation

* Refactoring changes

* Refactoring changes

* Fix e2e test

* Fix unintentional comment change

* Remove unneeded comments

* Clean up comments and make code more readable

* Simplify get controller metrics

* Clean up cgroupapi

* Cleanup cgroup -> controllermetrics changes

* Clean up cgroup configurator

* Fix unit tests for agent.py

* Fix cgroupapi tests

* Fix cgroupconfigurator and tests

* Rename controller metrics tests

* Ignore pylint issues

* Improve test coverage for cgroupapi

* Rename cgroup to metrics

* Update cgroup.procs to accurately represent file
@@ -279,11 +259,6 @@ def _is_systemd_failure(scope_name, stderr):
unit_not_found = "Unit {0} not found.".format(scope_name)
return unit_not_found in stderr or scope_name not in stderr

@staticmethod
def get_processes_in_cgroup(cgroup_path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to Cgroup class

@@ -202,7 +202,6 @@ class _SystemdCgroupApi(object):
Cgroup interface via systemd. Contains common api implementations between cgroup v1 and v2.
"""
def __init__(self):
self._agent_unit_name = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was an unused property


return True

def get_controller_metrics(self, expected_relative_path=None):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next step for cgroup v2 support will be adding logic to track metrics for cpu and memory in v2

try:
daemon = os.getppid()
extension_handler = os.getpid()
agent_commands = set()
agent_commands.update(shellutil.get_running_commands())
systemd_run_commands = set()
systemd_run_commands.update(self._cgroups_api.get_systemd_run_commands())
agent_cgroup = self._cgroups_api.get_processes_in_cgroup(cgroup_path)
agent_cgroup_proccesses = self._agent_cgroup.get_processes()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, before we were getting processes in only CPU cgroup.procs (or only Memory cgroups.procs if cpu wasn't mounted). Now this gets processes under cpu and memory if both mounted.

@@ -780,59 +725,37 @@ def _get_parent(pid):
return 0

def start_tracking_unit_cgroups(self, unit_name):
"""
TODO: Start tracking Memory Cgroups
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this comment because memory was already being tracked.


except Exception as exception:
log_cgroup_info("Failed to start tracking resource usage for the extension: {0}".format(ustr(exception)), send_event=False)

def stop_tracking_unit_cgroups(self, unit_name):
"""
TODO: remove Memory cgroups from tracked list.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this comment because memory was already being tracked.


except Exception as exception:
log_cgroup_info("Failed to stop tracking resource usage for the extension service: {0}".format(ustr(exception)), send_event=False)

def stop_tracking_extension_cgroups(self, extension_name):
"""
TODO: remove extension Memory cgroups from tracked list
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this comment because memory was already being tracked.

for metric in metrics:
for prop in metric.get_unit_properties():
log_cgroup_info('{0}: {1}'.format(prop, systemd.get_unit_property(systemd.get_agent_unit_name(), prop)))
CGroupsTelemetry.track_cgroup(metric)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now we were tracking cgroup before agent quota set, so we may get wrong data. It should match how did before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha, I'll move this below set_cpu_quota


return in_expected_slice

def get_controller_metrics(self, expected_relative_path=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when do we send expected_path? and also not case. Cgroup instance created from relative paths, so controller_paths should have those paths, so why we need expected_path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use expected_path when we only want to track the metrics if the process cgroup is at the expected path.

For example, in cgroupconfigurator.initialize() we only want to track the agent metrics if the cgroup is at this path: 'azure.slice/waagent.service'. If it's at any other path, we don't want to track metrics and we disable cgroups.
See lines 202-205 in cgroupconfigurator for usage


if expected_relative_path is not None:
expected_path = ""
controller_mountpoint = self._controller_mountpoints[controller]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may fail if controller not found in mountpoint. Consider switch to get and handle None case and log the warning if it;s none

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here was that if the controller exists in self._controller_paths, then it must be mounted. But I will add none check here to be cautious

def unknown_process_found(cpu_cgroup):
cgroup_procs_path = os.path.join(cpu_cgroup, "cgroup.procs")
def unknown_process_found(cgroup_path):
cgroup_procs_path = os.path.join(cgroup_path, "cgroup.procs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get cgroup.procs directly from systemd api in agent, instead of building path here?

unit_controlgroup_path = systemd.get_unit_property(unit_name, "ControlGroup")

if isinstance(cgroups_api, SystemdCgroupApiv1):
return os.path.join(cgroups_api.get_controller_mountpoints().get('cpu,cpuacct'), unit_controlgroup_path[1:])
Copy link
Contributor

@nagworld9 nagworld9 Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can abstract this in cgroup class?
why can't use cgroups_api. get_unit_cgroup() and get cgroup object and add method to get cgroup cpu or memory path based on function parameters or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could but it doesn't really make sense for v2 since all controllers are at the same path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think method we have here for constructing cgroup.procs path, at least we should change that

maddieford and others added 2 commits June 14, 2024 11:25
* Address Nag's comments

* pyling
cpu_cgroup, _ = get_unit_cgroup_paths(AGENT_SERVICE_NAME)

found: bool = retry_if_false(lambda: unknown_process_found(cpu_cgroup), attempts=3)
found: bool = retry_if_false(lambda: unknown_process_found(), attempts=3) # pylint: disable=W0108
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you check if lambda needed if function don't have args?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the first arg to retry_if_false needs to be Callable. it won't accept bool as first arg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll check if i can remove () and lambda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@maddieford maddieford merged commit 610e12b into Azure:develop Jun 16, 2024
10 of 11 checks passed
@maddieford maddieford deleted the refactor_cgroup_controllers branch June 16, 2024 18:42
@@ -169,10 +169,16 @@ def get_tracked_metrics(self, **_):
"""
raise NotImplementedError()

def get_unit_properties(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite get the purpose of the unit properties. Seems like currently we are just logging them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah they are only being used to log the properties when we initialize cgroups. We log the properties before setting any quotas. Then after setting cpu quota, we log only the cpuquota property value.

It was logging we were doing before so I kept it, but since we log the properties after setting quotas it might not be necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe remove them altogether? the log message doesn't have any context about what those are

or add more context to the message, if you think this could be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open a PR to improve the message. I think it may be useful to include, so that we can see from the logs if there were any quotas previously set before the current agent attempts to set quotas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants