-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Periodic Log Collection in systemd distros #2295
Enable Periodic Log Collection in systemd distros #2295
Conversation
a03b983
to
8fd744c
Compare
Codecov Report
@@ Coverage Diff @@
## develop #2295 +/- ##
===========================================
- Coverage 70.71% 70.68% -0.04%
===========================================
Files 96 96
Lines 13952 13940 -12
Branches 2001 1998 -3
===========================================
- Hits 9866 9853 -13
- Misses 3644 3647 +3
+ Partials 442 440 -2
Continue to review full report at Codecov.
|
…uxAgent into enable_log-collector
@@ -169,18 +169,12 @@ def _collect_logs(self): | |||
|
|||
systemd_cmd = [ | |||
"systemd-run", "--unit={0}".format(logcollector.CGROUPS_UNIT), | |||
"--slice={0}".format(cgroupconfigurator.AZURE_SLICE) | |||
"--slice={0}".format(cgroupconfigurator.LOGCOLLECTOR_SLICE), "--scope" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure, are we managing logcollector under azure.slice/azure-walinuxagent-periodic_logcollector.slice/collect-logs.scope
If so, what I read is systemd-run creates the scope under the system slice by default as per our implementation for extension slice
https://github.com/Azure/WALinuxAgent/blob/develop/azurelinuxagent/common/cgroupapi.py#L258.
May be either defining slice name in contents or something like creating file under azure.slice should do the trick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per our offline sync, systemd-run
is unfortunately a bit wild with how it generates the hierarchical slices: we're passing azure-walinuxagent-periodic_logcollector.slice
as the slice name here, and systemd-run
translates that to azure.slice/azure-walinuxagent.slice/azure-walinuxagent-periodic_logcollector.slice
by using the -
s in the original as delimiters.
azurelinuxagent/agent.py
Outdated
cpu_slice_matches, cpu_unit_matches = validate_cgroup_path(cpu_cgroup_path) | ||
memory_slice_matches, memory_unit_matches = validate_cgroup_path(memory_cgroup_path) | ||
cpu_slice_matches, cpu_unit_matches = validate_cgroup_path(cpu_cgroup_path, | ||
cgroupconfigurator.LOGCOLLECTOR_SLICE, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why None
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to @nagworld9's question: why is there an asymmetry between the cpu & memory paths? (i.e. None vs CGROUPS_UNIT) thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nagworld9, this is a reflection of systemd v245
's behavior on my ubuntu 20.04 dev VM; the path within the process cgroup file is azure.slice/azure-walinuxagent.slice/azure-walinuxagent-periodic_logcollector.slice
(note the missing unit name, collect-logs.scope
). It isn't universal though, unfortunately, as I don't believe this asymmetry exists on ubuntu 18.04 (off the top of my head, systemd v237
).
@nagworld9 & @narrieta: I need to do more research here as I'm not quite sure if this is just a configuration error, but barring that, we might need to abstract this check into osutil
or something.
azurelinuxagent/agent.py
Outdated
if path is None: | ||
return False, False | ||
|
||
regex_match = re.match(r'^(?P<slice>.*)/(?P<unit>.*)$', path) | ||
# '(.*/)?': process slice may be nested in other, heirarchical slices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small typo: "heirarchical"
azurelinuxagent/agent.py
Outdated
cpu_slice_matches, cpu_unit_matches = validate_cgroup_path(cpu_cgroup_path, | ||
cgroupconfigurator.LOGCOLLECTOR_SLICE, None) | ||
memory_slice_matches, memory_unit_matches = validate_cgroup_path(memory_cgroup_path, | ||
cgroupconfigurator.LOGCOLLECTOR_SLICE, logcollector.CGROUPS_UNIT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why using print? is this leftover debug code or should those messages go to the agent's log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print
s get routed, undecorated, to the log file when invoked by the periodic task and (obviously) printed out to stdout
on manual invocation. (so the second)
@@ -290,6 +305,12 @@ def __setup_azure_slice(): | |||
if not os.path.exists(vmextensions_slice): | |||
files_to_create.append((vmextensions_slice, _VMEXTENSIONS_SLICE_CONTENTS)) | |||
|
|||
if not os.path.exists(logcollector_slice): | |||
from azurelinuxagent.ga.collect_logs import CollectLogsHandler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this import to the top of the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about a cyclical import (as we import cgroupconfigurator
into ga.collect_logs
); I haven't actually tried it yet, though, so I can do so and update here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: this indeed does cause a cyclical import, so I'm inclined to leave it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but coming to think about it, the CGroupConfigurator should not have a dependency on the CollectLogsHandler.
Seems like you just need the value of a few constants? Maybe move those to this module, or move them to a module that both CGroupConfigurator and CollectLogsHandler depend on?
azurelinuxagent/agent.py
Outdated
|
||
# '(.*/)?': process slice may be nested in other, hierarchical slices. | ||
# '[^\s./]*': process slice can't contain spaces, periods, or slashes. | ||
slice_regex = r'(.*/)?(?P<slice>[^\s./]*.slice)' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need such a generic regex for this check. Would it be enough to check that collector is within cgroupconfigurator.LOGCOLLECTOR_SLICE?
azurelinuxagent/agent.py
Outdated
|
||
slice_group, unit_group = regex_match.group("slice", "unit") | ||
|
||
slice_matches = (slice_group == cgroupconfigurator.AZURE_SLICE) | ||
unit_matches = (unit_group == logcollector.CGROUPS_UNIT) | ||
if unit_group != expected_unit: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just check against cgroupconfigurator.LOGCOLLECTOR_SLICE I think this check can also be omitted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe I am misunderstanding the reason for these 2 checks? The first check silently returns False, while this emits telemetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the existing code for this function is more complicated because the collector was not on its own slice. Seems to me that most of this function can be replaced with a simple check for cgroupconfigurator.LOGCOLLECTOR_SLICE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the process fails the first check, it should exit with INVALID_CGROUP_ERROR. The idea behind this check was to get telemetry for when we run into the abnormal /proc/self/cgroup
content.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the first check doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, misread the code... this is a helper function...
but i still think this is the case:
Note that the existing code for this function is more complicated because the collector was not on its own slice. Seems to me that most of this function can be replaced with a simple check for cgroupconfigurator.LOGCOLLECTOR_SLICE
Description
Set the default configuration to enable periodic log collection.
PR information
Quality of Code and Contribution Guidelines