-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor cgroup controllers #3135
refactor cgroup controllers #3135
Conversation
* Refactor Cgroup, CpuCgroup, MemoryCgroup to ControllerMetrics, CpuMetrics, MemoryMetrics * Create methods to get unit/process cgroup representation * Refactoring changes * Refactoring changes * Fix e2e test * Fix unintentional comment change * Remove unneeded comments * Clean up comments and make code more readable * Simplify get controller metrics * Clean up cgroupapi * Cleanup cgroup -> controllermetrics changes * Clean up cgroup configurator * Fix unit tests for agent.py * Fix cgroupapi tests * Fix cgroupconfigurator and tests * Rename controller metrics tests * Ignore pylint issues * Improve test coverage for cgroupapi * Rename cgroup to metrics * Update cgroup.procs to accurately represent file
@@ -279,11 +259,6 @@ def _is_systemd_failure(scope_name, stderr): | |||
unit_not_found = "Unit {0} not found.".format(scope_name) | |||
return unit_not_found in stderr or scope_name not in stderr | |||
|
|||
@staticmethod | |||
def get_processes_in_cgroup(cgroup_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to Cgroup class
@@ -202,7 +202,6 @@ class _SystemdCgroupApi(object): | |||
Cgroup interface via systemd. Contains common api implementations between cgroup v1 and v2. | |||
""" | |||
def __init__(self): | |||
self._agent_unit_name = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was an unused property
|
||
return True | ||
|
||
def get_controller_metrics(self, expected_relative_path=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next step for cgroup v2 support will be adding logic to track metrics for cpu and memory in v2
try: | ||
daemon = os.getppid() | ||
extension_handler = os.getpid() | ||
agent_commands = set() | ||
agent_commands.update(shellutil.get_running_commands()) | ||
systemd_run_commands = set() | ||
systemd_run_commands.update(self._cgroups_api.get_systemd_run_commands()) | ||
agent_cgroup = self._cgroups_api.get_processes_in_cgroup(cgroup_path) | ||
agent_cgroup_proccesses = self._agent_cgroup.get_processes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, before we were getting processes in only CPU cgroup.procs (or only Memory cgroups.procs if cpu wasn't mounted). Now this gets processes under cpu and memory if both mounted.
@@ -780,59 +725,37 @@ def _get_parent(pid): | |||
return 0 | |||
|
|||
def start_tracking_unit_cgroups(self, unit_name): | |||
""" | |||
TODO: Start tracking Memory Cgroups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this comment because memory was already being tracked.
|
||
except Exception as exception: | ||
log_cgroup_info("Failed to start tracking resource usage for the extension: {0}".format(ustr(exception)), send_event=False) | ||
|
||
def stop_tracking_unit_cgroups(self, unit_name): | ||
""" | ||
TODO: remove Memory cgroups from tracked list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this comment because memory was already being tracked.
|
||
except Exception as exception: | ||
log_cgroup_info("Failed to stop tracking resource usage for the extension service: {0}".format(ustr(exception)), send_event=False) | ||
|
||
def stop_tracking_extension_cgroups(self, extension_name): | ||
""" | ||
TODO: remove extension Memory cgroups from tracked list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this comment because memory was already being tracked.
for metric in metrics: | ||
for prop in metric.get_unit_properties(): | ||
log_cgroup_info('{0}: {1}'.format(prop, systemd.get_unit_property(systemd.get_agent_unit_name(), prop))) | ||
CGroupsTelemetry.track_cgroup(metric) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we were tracking cgroup before agent quota set, so we may get wrong data. It should match how did before
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, I'll move this below set_cpu_quota
|
||
return in_expected_slice | ||
|
||
def get_controller_metrics(self, expected_relative_path=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when do we send expected_path? and also not case. Cgroup instance created from relative paths, so controller_paths should have those paths, so why we need expected_path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use expected_path when we only want to track the metrics if the process cgroup is at the expected path.
For example, in cgroupconfigurator.initialize() we only want to track the agent metrics if the cgroup is at this path: 'azure.slice/waagent.service'. If it's at any other path, we don't want to track metrics and we disable cgroups.
See lines 202-205 in cgroupconfigurator for usage
azurelinuxagent/ga/cgroupapi.py
Outdated
|
||
if expected_relative_path is not None: | ||
expected_path = "" | ||
controller_mountpoint = self._controller_mountpoints[controller] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may fail if controller not found in mountpoint. Consider switch to get and handle None case and log the warning if it;s none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning here was that if the controller exists in self._controller_paths, then it must be mounted. But I will add none check here to be cautious
def unknown_process_found(cpu_cgroup): | ||
cgroup_procs_path = os.path.join(cpu_cgroup, "cgroup.procs") | ||
def unknown_process_found(cgroup_path): | ||
cgroup_procs_path = os.path.join(cgroup_path, "cgroup.procs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we get cgroup.procs directly from systemd api in agent, instead of building path here?
unit_controlgroup_path = systemd.get_unit_property(unit_name, "ControlGroup") | ||
|
||
if isinstance(cgroups_api, SystemdCgroupApiv1): | ||
return os.path.join(cgroups_api.get_controller_mountpoints().get('cpu,cpuacct'), unit_controlgroup_path[1:]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can abstract this in cgroup class?
why can't use cgroups_api. get_unit_cgroup() and get cgroup object and add method to get cgroup cpu or memory path based on function parameters or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could but it doesn't really make sense for v2 since all controllers are at the same path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I think method we have here for constructing cgroup.procs path, at least we should change that
cpu_cgroup, _ = get_unit_cgroup_paths(AGENT_SERVICE_NAME) | ||
|
||
found: bool = retry_if_false(lambda: unknown_process_found(cpu_cgroup), attempts=3) | ||
found: bool = retry_if_false(lambda: unknown_process_found(), attempts=3) # pylint: disable=W0108 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you check if lambda needed if function don't have args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the first arg to retry_if_false needs to be Callable. it won't accept bool as first arg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll check if i can remove () and lambda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
@@ -169,10 +169,16 @@ def get_tracked_metrics(self, **_): | |||
""" | |||
raise NotImplementedError() | |||
|
|||
def get_unit_properties(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite get the purpose of the unit properties. Seems like currently we are just logging them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah they are only being used to log the properties when we initialize cgroups. We log the properties before setting any quotas. Then after setting cpu quota, we log only the cpuquota property value.
It was logging we were doing before so I kept it, but since we log the properties after setting quotas it might not be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe remove them altogether? the log message doesn't have any context about what those are
or add more context to the message, if you think this could be useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll open a PR to improve the message. I think it may be useful to include, so that we can see from the logs if there were any quotas previously set before the current agent attempts to set quotas
Description
The current code for cgroups is overcomplicated because it keeps track of each controller path separately. If we want to track additional controllers in the future with the existing code, we would need to update each block of code that references controller paths. In cgroup v2, there is a unified cgroup path, which makes it unnecessary to track different paths for each controller.
The changes in this PR simplify the code by creating an abstraction for a cgroup which handles any operations that require cgroup paths. The changes include renaming the CGroup/CpuCGroup/MemoryCGroup classes to ControllerMetrics/CpuMetrics/MemoryMetrics and introduces the Cgroup/CgroupV1/CgroupV2 classes in the cgroupapi to represent a cgroup. The cgroupapi was updated to return instances of a Cgroup instead of tuples of controller paths.
Note this PR does not yet add logic to track cpu/memory metrics in v2. Those changes will be included in a future PR.
Test run of all cgroup scenarios: https://dev.azure.com/cplatruntime/WALinuxAgent/_build/results?buildId=9290&view=results
Issue #
PR information
Quality of Code and Contribution Guidelines