Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check for unexpected process in agent cgroups before cgroups enabled #3103

Merged
merged 6 commits into from
Apr 4, 2024

Conversation

nagworld9
Copy link
Contributor

Description

This pr fixes the issue if unexpected process in the agent cgroup before enable cgroups, we don't continue with cgroups limit setup.

We have observed a scenario in production that
long running extension processes may be in agent cgroups if agent goes this cycle enabled(1)->disabled(2)->enabled(3).
1. Agent cgroups enabled in some version
2. Disabled agent cgroups due to check_cgroups regular check, now extension runs will be in agent cgroups.
3. When ext_hanlder restart and enable the cgroups again, already running processes from step 2 still be in agent cgroups. This may cause the extensions run with agent limit.

Issue #


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@codecov-commenter
Copy link

codecov-commenter commented Apr 1, 2024

Codecov Report

Attention: Patch coverage is 43.75000% with 18 lines in your changes are missing coverage. Please review.

Project coverage is 71.87%. Comparing base (3f49315) to head (a1dbe21).

❗ Current head a1dbe21 differs from pull request most recent head fd13729. Consider uploading reports for the commit fd13729 to get more accurate results

Files Patch % Lines
azurelinuxagent/ga/cgroupconfigurator.py 43.75% 17 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop    #3103   +/-   ##
========================================
  Coverage    71.87%   71.87%           
========================================
  Files          110      110           
  Lines        16406    16424   +18     
  Branches      2344     2348    +4     
========================================
+ Hits         11791    11805   +14     
- Misses        4064     4067    +3     
- Partials       551      552    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -146,45 +146,18 @@ def wait_for_log_message(message, timeout=datetime.timedelta(minutes=5)):
fail("The agent did not find [{0}] in its log within the allowed timeout".format(message))


def verify_process_check_on_agent_cgroups():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed here since checking as part of new test

@@ -205,11 +178,20 @@ def verify_throttling_time_check_on_agent_cgroups():
fail("The agent did not disable its CPUQuota: {0}".format(get_agent_cpu_quota()))


def cleanup_test_setup():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test has special setup, so cleaning up to avoid affecting other tests..

azurelinuxagent/ga/cgroupconfigurator.py Outdated Show resolved Hide resolved
self._check_processes_in_agent_cgroup()
except CGroupsException as exception:
_log_cgroup_warning(ustr(exception))
return True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels unintuitive that this function returns True when the check fails. Can we rename the function or switch the return value?

tests_e2e/tests/lib/cgroup_helpers.py Show resolved Hide resolved
Note: System may kick the added process out of the cgroups, keeps adding until agent detect that process
"""
def unknown_process_found():
cgroup_procs_path = os.path.join(BASE_CGROUP, "cpu,cpuacct" + get_agent_cgroup_mount_path(), "cgroup.procs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this path is specific to v1

def unknown_process_found():
cgroup_procs_path = os.path.join(BASE_CGROUP, "cpu,cpuacct" + get_agent_cgroup_mount_path(), "cgroup.procs")
log.info("Adding dummy process %s to cgroup.procs file %s", pid, cgroup_procs_path)
process = subprocess.Popen(f"echo {pid} > {cgroup_procs_path}", shell=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check for errors writing to the file. you can use run_command or just python's open/write functions


def creating_dummy_process():
log.info("Creating dummy process to add to agent's cgroup")
dd_command = ["dd", "if=/dev/zero", "of=/dev/null"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intention of using dd? this will consume high CPU. sleep may be good enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no intention, I remembered to use this command, decided to use. I can change it


The issue we observed that long running extension processes may be in agent cgroups if agent goes this cycle enabled(1)->disabled(2)->enabled(3).
1. Agent cgroups enabled in some version
2. Disabled agent cgroups due to check_cgroups regular check, now extension runs will be in agent cgroups.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand #2.
If cgroups are disabled because some unexpected process is found in agent cgroup, then why would extension runs now be in agent cgroups?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once agent cgroups disabled (disable for both agent and extensions), we don't run the extensions in the extension slice. That means it will be in agent cgroups.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please add those details to the comment

if agent_slice not in AZURE_SLICE:
return False

if self._agent_cpu_cgroup_path is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this check here?

Copy link
Contributor Author

@nagworld9 nagworld9 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use this path for finding processes in cgroup.procs file. It could be none for v2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we use agent_cpu_cgroup_path to check which processes in agent cgroup? Why not memory_cgroup_path? or should we be checking both cpu and memory paths?

it's possible for memory to be mounted/in use but cpu not

@@ -905,6 +909,33 @@ def test_check_cgroups_should_disable_cgroups_when_a_check_fails(self):
for p in patchers:
p.stop()

@patch('azurelinuxagent.ga.cgroupconfigurator.CGroupConfigurator._Impl._check_processes_in_agent_cgroup', side_effect=CGroupsException("Test"))
@patch('azurelinuxagent.ga.cgroupconfigurator.add_event')
def test_agent_not_enable_cgroups_if_unexpected_process_already_in_agent_cgroups(self, add_event, _):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def test_agent_not_enable_cgroups_if_unexpected_process_already_in_agent_cgroups(self, add_event, _):
def test_agent_should_not_enable_cgroups_if_unexpected_process_already_in_agent_cgroups(self, add_event, _):

@nagworld9 nagworld9 merged commit 782a165 into Azure:develop Apr 4, 2024
11 checks passed
@nagworld9 nagworld9 deleted the agentcgroup-check branch April 4, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants