-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
check for unexpected process in agent cgroups before cgroups enabled #3103
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #3103 +/- ##
========================================
Coverage 71.87% 71.87%
========================================
Files 110 110
Lines 16406 16424 +18
Branches 2344 2348 +4
========================================
+ Hits 11791 11805 +14
- Misses 4064 4067 +3
- Partials 551 552 +1 ☔ View full report in Codecov by Sentry. |
@@ -146,45 +146,18 @@ def wait_for_log_message(message, timeout=datetime.timedelta(minutes=5)): | |||
fail("The agent did not find [{0}] in its log within the allowed timeout".format(message)) | |||
|
|||
|
|||
def verify_process_check_on_agent_cgroups(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed here since checking as part of new test
@@ -205,11 +178,20 @@ def verify_throttling_time_check_on_agent_cgroups(): | |||
fail("The agent did not disable its CPUQuota: {0}".format(get_agent_cpu_quota())) | |||
|
|||
|
|||
def cleanup_test_setup(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test has special setup, so cleaning up to avoid affecting other tests..
self._check_processes_in_agent_cgroup() | ||
except CGroupsException as exception: | ||
_log_cgroup_warning(ustr(exception)) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels unintuitive that this function returns True when the check fails. Can we rename the function or switch the return value?
Note: System may kick the added process out of the cgroups, keeps adding until agent detect that process | ||
""" | ||
def unknown_process_found(): | ||
cgroup_procs_path = os.path.join(BASE_CGROUP, "cpu,cpuacct" + get_agent_cgroup_mount_path(), "cgroup.procs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this path is specific to v1
def unknown_process_found(): | ||
cgroup_procs_path = os.path.join(BASE_CGROUP, "cpu,cpuacct" + get_agent_cgroup_mount_path(), "cgroup.procs") | ||
log.info("Adding dummy process %s to cgroup.procs file %s", pid, cgroup_procs_path) | ||
process = subprocess.Popen(f"echo {pid} > {cgroup_procs_path}", shell=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to check for errors writing to the file. you can use run_command or just python's open/write functions
|
||
def creating_dummy_process(): | ||
log.info("Creating dummy process to add to agent's cgroup") | ||
dd_command = ["dd", "if=/dev/zero", "of=/dev/null"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the intention of using dd? this will consume high CPU. sleep may be good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no intention, I remembered to use this command, decided to use. I can change it
tests_e2e/tests/scripts/agent_cgroups_process_check-unknown_process_check.py
Show resolved
Hide resolved
|
||
The issue we observed that long running extension processes may be in agent cgroups if agent goes this cycle enabled(1)->disabled(2)->enabled(3). | ||
1. Agent cgroups enabled in some version | ||
2. Disabled agent cgroups due to check_cgroups regular check, now extension runs will be in agent cgroups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand #2.
If cgroups are disabled because some unexpected process is found in agent cgroup, then why would extension runs now be in agent cgroups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once agent cgroups disabled (disable for both agent and extensions), we don't run the extensions in the extension slice. That means it will be in agent cgroups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please add those details to the comment
if agent_slice not in AZURE_SLICE: | ||
return False | ||
|
||
if self._agent_cpu_cgroup_path is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use this path for finding processes in cgroup.procs file. It could be none for v2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we use agent_cpu_cgroup_path to check which processes in agent cgroup? Why not memory_cgroup_path? or should we be checking both cpu and memory paths?
it's possible for memory to be mounted/in use but cpu not
tests/ga/test_cgroupconfigurator.py
Outdated
@@ -905,6 +909,33 @@ def test_check_cgroups_should_disable_cgroups_when_a_check_fails(self): | |||
for p in patchers: | |||
p.stop() | |||
|
|||
@patch('azurelinuxagent.ga.cgroupconfigurator.CGroupConfigurator._Impl._check_processes_in_agent_cgroup', side_effect=CGroupsException("Test")) | |||
@patch('azurelinuxagent.ga.cgroupconfigurator.add_event') | |||
def test_agent_not_enable_cgroups_if_unexpected_process_already_in_agent_cgroups(self, add_event, _): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_agent_not_enable_cgroups_if_unexpected_process_already_in_agent_cgroups(self, add_event, _): | |
def test_agent_should_not_enable_cgroups_if_unexpected_process_already_in_agent_cgroups(self, add_event, _): |
Description
This pr fixes the issue if unexpected process in the agent cgroup before enable cgroups, we don't continue with cgroups limit setup.
We have observed a scenario in production that
long running extension processes may be in agent cgroups if agent goes this cycle enabled(1)->disabled(2)->enabled(3).
1. Agent cgroups enabled in some version
2. Disabled agent cgroups due to check_cgroups regular check, now extension runs will be in agent cgroups.
3. When ext_hanlder restart and enable the cgroups again, already running processes from step 2 still be in agent cgroups. This may cause the extensions run with agent limit.
Issue #
PR information
Quality of Code and Contribution Guidelines