Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Periodic Log Collection in systemd distros #2295

Merged
merged 20 commits into from
Jul 20, 2021

Conversation

kevinclark19a
Copy link
Contributor

Description

Set the default configuration to enable periodic log collection.


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

@codecov
Copy link

codecov bot commented Jul 6, 2021

Codecov Report

Merging #2295 (44c8f47) into develop (f6699cd) will decrease coverage by 0.03%.
The diff coverage is 78.94%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2295      +/-   ##
===========================================
- Coverage    70.71%   70.68%   -0.04%     
===========================================
  Files           96       96              
  Lines        13952    13940      -12     
  Branches      2001     1998       -3     
===========================================
- Hits          9866     9853      -13     
- Misses        3644     3647       +3     
+ Partials       442      440       -2     
Impacted Files Coverage Δ
azurelinuxagent/ga/collect_logs.py 82.26% <62.50%> (-1.41%) ⬇️
azurelinuxagent/common/cgroupconfigurator.py 70.50% <87.50%> (+0.09%) ⬆️
azurelinuxagent/agent.py 57.43% <100.00%> (-0.77%) ⬇️
azurelinuxagent/common/event.py 85.22% <0.00%> (-0.69%) ⬇️
azurelinuxagent/ga/collect_telemetry_events.py 90.06% <0.00%> (+0.34%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6699cd...44c8f47. Read the comment docs.

@@ -169,18 +169,12 @@ def _collect_logs(self):

systemd_cmd = [
"systemd-run", "--unit={0}".format(logcollector.CGROUPS_UNIT),
"--slice={0}".format(cgroupconfigurator.AZURE_SLICE)
"--slice={0}".format(cgroupconfigurator.LOGCOLLECTOR_SLICE), "--scope"
Copy link
Contributor

@nagworld9 nagworld9 Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, are we managing logcollector under azure.slice/azure-walinuxagent-periodic_logcollector.slice/collect-logs.scope If so, what I read is systemd-run creates the scope under the system slice by default as per our implementation for extension slice
https://github.com/Azure/WALinuxAgent/blob/develop/azurelinuxagent/common/cgroupapi.py#L258.
May be either defining slice name in contents or something like creating file under azure.slice should do the trick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per our offline sync, systemd-run is unfortunately a bit wild with how it generates the hierarchical slices: we're passing azure-walinuxagent-periodic_logcollector.slice as the slice name here, and systemd-run translates that to azure.slice/azure-walinuxagent.slice/azure-walinuxagent-periodic_logcollector.slice by using the -s in the original as delimiters.

cpu_slice_matches, cpu_unit_matches = validate_cgroup_path(cpu_cgroup_path)
memory_slice_matches, memory_unit_matches = validate_cgroup_path(memory_cgroup_path)
cpu_slice_matches, cpu_unit_matches = validate_cgroup_path(cpu_cgroup_path,
cgroupconfigurator.LOGCOLLECTOR_SLICE, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why None here?

Copy link
Member

@narrieta narrieta Jul 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to @nagworld9's question: why is there an asymmetry between the cpu & memory paths? (i.e. None vs CGROUPS_UNIT) thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nagworld9, this is a reflection of systemd v245's behavior on my ubuntu 20.04 dev VM; the path within the process cgroup file is azure.slice/azure-walinuxagent.slice/azure-walinuxagent-periodic_logcollector.slice (note the missing unit name, collect-logs.scope). It isn't universal though, unfortunately, as I don't believe this asymmetry exists on ubuntu 18.04 (off the top of my head, systemd v237).

@nagworld9 & @narrieta: I need to do more research here as I'm not quite sure if this is just a configuration error, but barring that, we might need to abstract this check into osutil or something.

if path is None:
return False, False

regex_match = re.match(r'^(?P<slice>.*)/(?P<unit>.*)$', path)
# '(.*/)?': process slice may be nested in other, heirarchical slices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small typo: "heirarchical"

azurelinuxagent/agent.py Outdated Show resolved Hide resolved
cpu_slice_matches, cpu_unit_matches = validate_cgroup_path(cpu_cgroup_path,
cgroupconfigurator.LOGCOLLECTOR_SLICE, None)
memory_slice_matches, memory_unit_matches = validate_cgroup_path(memory_cgroup_path,
cgroupconfigurator.LOGCOLLECTOR_SLICE, logcollector.CGROUPS_UNIT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why using print? is this leftover debug code or should those messages go to the agent's log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prints get routed, undecorated, to the log file when invoked by the periodic task and (obviously) printed out to stdout on manual invocation. (so the second)

@@ -290,6 +305,12 @@ def __setup_azure_slice():
if not os.path.exists(vmextensions_slice):
files_to_create.append((vmextensions_slice, _VMEXTENSIONS_SLICE_CONTENTS))

if not os.path.exists(logcollector_slice):
from azurelinuxagent.ga.collect_logs import CollectLogsHandler
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this import to the top of the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was worried about a cyclical import (as we import cgroupconfigurator into ga.collect_logs); I haven't actually tried it yet, though, so I can do so and update here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: this indeed does cause a cyclical import, so I'm inclined to leave it as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but coming to think about it, the CGroupConfigurator should not have a dependency on the CollectLogsHandler.

Seems like you just need the value of a few constants? Maybe move those to this module, or move them to a module that both CGroupConfigurator and CollectLogsHandler depend on?


# '(.*/)?': process slice may be nested in other, hierarchical slices.
# '[^\s./]*': process slice can't contain spaces, periods, or slashes.
slice_regex = r'(.*/)?(?P<slice>[^\s./]*.slice)'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need such a generic regex for this check. Would it be enough to check that collector is within cgroupconfigurator.LOGCOLLECTOR_SLICE?


slice_group, unit_group = regex_match.group("slice", "unit")

slice_matches = (slice_group == cgroupconfigurator.AZURE_SLICE)
unit_matches = (unit_group == logcollector.CGROUPS_UNIT)
if unit_group != expected_unit:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just check against cgroupconfigurator.LOGCOLLECTOR_SLICE I think this check can also be omitted

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe I am misunderstanding the reason for these 2 checks? The first check silently returns False, while this emits telemetry.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the existing code for this function is more complicated because the collector was not on its own slice. Seems to me that most of this function can be replaced with a simple check for cgroupconfigurator.LOGCOLLECTOR_SLICE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the process fails the first check, it should exit with INVALID_CGROUP_ERROR. The idea behind this check was to get telemetry for when we run into the abnormal /proc/self/cgroup content.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the first check doing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, misread the code... this is a helper function...

but i still think this is the case:

Note that the existing code for this function is more complicated because the collector was not on its own slice. Seems to me that most of this function can be replaced with a simple check for cgroupconfigurator.LOGCOLLECTOR_SLICE

narrieta
narrieta previously approved these changes Jul 19, 2021
@kevinclark19a kevinclark19a merged commit 6492ebd into Azure:develop Jul 20, 2021
@kevinclark19a kevinclark19a deleted the enable_log-collector branch December 13, 2021 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants