Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge release 2.5.0.0 into master #2448

Merged
merged 50 commits into from
Dec 20, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
4445be2
update test-requirements to pin pylint. (#2288)
kevinclark19a Jun 30, 2021
72c6e1a
Allow systemd-run in the Agent's cgroup (#2287)
narrieta Jun 30, 2021
94890a0
onboard ubuntu20 (#2279)
nagworld9 Jul 1, 2021
5ab5a83
Move Github Actions VMs to Ubuntu 18 (#2291)
kevinclark19a Jul 1, 2021
b268011
Add debug info for systemd-run false positives (#2292)
narrieta Jul 1, 2021
89c1a80
Add query for vmSettings (#2293)
narrieta Jul 2, 2021
e8a5a5e
Kill logcollector process if it is in the wrong cgroup (#2289)
kevinclark19a Jul 2, 2021
b8afae7
Added support for vmSettings' ETag (#2294)
narrieta Jul 6, 2021
1e03ab5
onboard redhat82 (#2290)
nagworld9 Jul 6, 2021
6bbb326
Remove trailing spaces from command name (#2296)
narrieta Jul 7, 2021
fe1f088
Save waagent_status to history folder and add additional details to t…
dhivyaganesan Jul 9, 2021
b160c23
Adding the new file to log collector (#2301)
dhivyaganesan Jul 13, 2021
c727403
Merge from develop
Jul 14, 2021
f6e384e
Merge pull request #2303 from Azure/hotfix-2.3.1
narrieta Jul 14, 2021
f040e09
Use If-None-Match for ETag header (#2304)
narrieta Jul 14, 2021
b082b1c
Merge remote-tracking branch 'upstream/develop' into fast-track
Jul 14, 2021
fcd0b23
Log correlation ID for errors in vmSettings requests (#2306)
narrieta Jul 14, 2021
f6699cd
Helper to handle exception message (#2305)
dhivyaganesan Jul 19, 2021
6492ebd
Enable Periodic Log Collection in ubuntu systemd distros (#2295)
kevinclark19a Jul 20, 2021
10a1be9
Query vmSettings only on new Goal State (#2313)
narrieta Jul 22, 2021
563e695
Merge branch 'develop' into fast-track
narrieta Jul 23, 2021
6d77254
Merge conflicts
Jul 23, 2021
cb847c3
Remove references to traceback
Jul 23, 2021
d087ba6
Merge pull request #2314 from Azure/fast-track
narrieta Jul 23, 2021
05eef39
Remove reference to re.IGNORECASE (#2316)
narrieta Jul 27, 2021
127defa
add and remove extension slice (#2315)
nagworld9 Jul 29, 2021
5cde43a
Added log statements to debug issues in vmSettings API (#2317)
narrieta Aug 2, 2021
582d89a
Handle HTTP GONE in vmSettings request (#2321)
narrieta Aug 6, 2021
265f46e
Rename Debug.FetchVmSettings to Debug.EnableFastTrack (#2324)
narrieta Aug 11, 2021
669105d
Update HostGAplugin headers before fetching vmSettings (#2323)
narrieta Aug 11, 2021
9227454
Do not create placeholder status file for AKS extensions (#2298) (#2328)
larohra Aug 13, 2021
4ea51f0
Dont create default status file for Single-Config extensions (#2318) …
larohra Aug 13, 2021
5be8d90
Refactor write_ext_handlers_status_to_info_file function (#2325)
dhivyaganesan Aug 16, 2021
ee13187
Report transitioning when status file not found (#2330)
larohra Aug 16, 2021
ccf8902
mock systemctl stop cmd (#2335)
nagworld9 Aug 20, 2021
bbbead6
Added message to test_run_latest's assertion (#2334)
narrieta Aug 23, 2021
bc8a373
Implement InitialGoalStatePeriod parameter + improvements in logging …
narrieta Aug 23, 2021
f02164a
enable ETP (out of preview) (#2337)
kevinclark19a Aug 23, 2021
5f0dee5
Fix operation name in InitializeHostPlugin event (#2338)
narrieta Aug 24, 2021
9e73d01
Save the original subprocess.Popen before mocking it (#2340)
narrieta Aug 25, 2021
167e494
added AlmaLinux (#2219)
JohnKepplers Aug 27, 2021
66cc4dd
Add back the ETP preview flag in HandlerEvironment.json and populate …
kevinclark19a Aug 30, 2021
141c660
Getting the agent version from the version file (#2346)
dhivyaganesan Sep 1, 2021
8bd66bf
Update Version (#2348)
dhivyaganesan Sep 1, 2021
0eeed5a
Fix bug with dependent extensions with no settings (#2285) (#2349)
nagworld9 Sep 2, 2021
81f7a1d
Use handler status if extension status is None when computing the Ext…
narrieta Sep 20, 2021
53629f9
Release preparation 2.5.0.1 (#2360)
dhivyaganesan Sep 20, 2021
a805142
Define ExtensionsSummary.__eq__ (#2371)
narrieta Sep 30, 2021
4488a62
Release preparation 2.5.0.2 (#2372)
dhivyaganesan Sep 30, 2021
40f750c
Merge branch 'master' into release-2.5.0.0
Dec 20, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/ci_pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
test-legacy-python-versions:

name: "Python 2.6 Unit Tests"
runs-on: ubuntu-16.04
runs-on: ubuntu-18.04

strategy:
fail-fast: false
Expand Down
26 changes: 22 additions & 4 deletions azurelinuxagent/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,22 +28,24 @@
import subprocess
import sys
import threading
import traceback
from azurelinuxagent.common import cgroupconfigurator, logcollector
from azurelinuxagent.common.cgroupapi import SystemdCgroupsApi

import azurelinuxagent.common.conf as conf
import azurelinuxagent.common.event as event
import azurelinuxagent.common.logger as logger
from azurelinuxagent.common.future import ustr
from azurelinuxagent.common.logcollector import LogCollector, OUTPUT_RESULTS_FILE_PATH
from azurelinuxagent.common.osutil import get_osutil
from azurelinuxagent.common.utils import fileutil
from azurelinuxagent.common.utils import fileutil, textutil
from azurelinuxagent.common.utils.flexible_version import FlexibleVersion
from azurelinuxagent.common.utils.networkutil import AddFirewallRules
from azurelinuxagent.common.version import AGENT_NAME, AGENT_LONG_VERSION, AGENT_VERSION, \
DISTRO_NAME, DISTRO_VERSION, \
PY_VERSION_MAJOR, PY_VERSION_MINOR, \
PY_VERSION_MICRO, GOAL_STATE_AGENT_VERSION, \
get_daemon_version, set_daemon_version
from azurelinuxagent.ga.collect_logs import CollectLogsHandler
from azurelinuxagent.pa.provision.default import ProvisionHandler


Expand Down Expand Up @@ -199,6 +201,22 @@ def collect_logs(self, is_full_mode):
else:
print("Running log collector mode normal")

# Check the cgroups unit
if CollectLogsHandler.should_validate_cgroups():
cpu_cgroup_path, memory_cgroup_path = SystemdCgroupsApi.get_process_cgroup_relative_paths("self")

cpu_slice_matches = (cgroupconfigurator.LOGCOLLECTOR_SLICE in cpu_cgroup_path)
memory_slice_matches = (cgroupconfigurator.LOGCOLLECTOR_SLICE in memory_cgroup_path)

if not cpu_slice_matches or not memory_slice_matches:
print("The Log Collector process is not in the proper cgroups:")
if not cpu_slice_matches:
print("\tunexpected cpu slice")
if not memory_slice_matches:
print("\tunexpected memory slice")

sys.exit(logcollector.INVALID_CGROUPS_ERRCODE)

try:
log_collector = LogCollector(is_full_mode)
archive = log_collector.collect_logs_and_get_archive()
Expand Down Expand Up @@ -259,10 +277,10 @@ def main(args=None):
agent.collect_logs(log_collector_full_mode)
elif command == AgentCommands.SetupFirewall:
agent.setup_firewall(firewall_metadata)
except Exception:
except Exception as e:
logger.error(u"Failed to run '{0}': {1}",
command,
traceback.format_exc())
textutil.format_exception(e))


def parse_args(sys_args):
Expand Down
23 changes: 12 additions & 11 deletions azurelinuxagent/common/cgroupapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@

CGROUPS_FILE_SYSTEM_ROOT = '/sys/fs/cgroup'
CGROUP_CONTROLLERS = ["cpu", "memory"]

EXTENSION_SLICE_PREFIX = "azure-vmextensions"

class SystemdRunError(CGroupsException):
"""
Expand Down Expand Up @@ -66,11 +66,6 @@ def track_cgroups(extension_cgroups):
logger.warn("Cannot add cgroup '{0}' to tracking list; resource usage will not be tracked. "
"Error: {1}".format(cgroup.path, ustr(exception)))

@staticmethod
def _get_extension_cgroup_name(extension_name):
# Since '-' is used as a separator in systemd unit names, we replace it with '_' to prevent side-effects.
return extension_name.replace('-', '_')

@staticmethod
def get_processes_in_cgroup(cgroup_path):
with open(os.path.join(cgroup_path, "cgroup.procs"), "r") as cgroup_procs:
Expand Down Expand Up @@ -234,27 +229,33 @@ def _is_systemd_failure(scope_name, stderr):
unit_not_found = "Unit {0} not found.".format(scope_name)
return unit_not_found in stderr or scope_name not in stderr

def start_extension_command(self, extension_name, command, timeout, shell, cwd, env, stdout, stderr, error_code=ExtensionErrorCodes.PluginUnknownFailure):
scope = "{0}_{1}".format(self._get_extension_cgroup_name(extension_name), uuid.uuid4())
@staticmethod
def get_extension_cgroup_name(extension_name):
# Since '-' is used as a separator in systemd unit names, we replace it with '_' to prevent side-effects.
return EXTENSION_SLICE_PREFIX + "-" + extension_name.replace('-', '_')

def start_extension_command(self, extension_name, command, cmd_name, timeout, shell, cwd, env, stdout, stderr, error_code=ExtensionErrorCodes.PluginUnknownFailure):
scope = "{0}_{1}".format(cmd_name, uuid.uuid4())
extension_slice_name = self.get_extension_cgroup_name(extension_name)
with self._systemd_run_commands_lock:
process = subprocess.Popen( # pylint: disable=W1509
"systemd-run --unit={0} --scope --slice=azure-vmextensions.slice {1}".format(scope, command),
"systemd-run --unit={0} --scope --slice={1}.slice {2}".format(scope, extension_slice_name, command),
shell=shell,
cwd=cwd,
stdout=stdout,
stderr=stderr,
env=env,
preexec_fn=os.setsid)

# We start systemd-run with shell == True so process.pid is the shell's pid, not the pid for systemd-run
self._systemd_run_commands.append(process.pid)

scope_name = scope + '.scope'

logger.info("Started extension in unit '{0}'", scope_name)

try:
# systemd-run creates the scope under the system slice by default
cgroup_relative_path = os.path.join('azure.slice/azure-vmextensions.slice', scope_name)
cgroup_relative_path = os.path.join('azure.slice/azure-vmextensions.slice', extension_slice_name + ".slice")

cpu_cgroup_mountpoint, _ = self.get_cgroup_mount_points()

Expand Down
129 changes: 106 additions & 23 deletions azurelinuxagent/common/cgroupconfigurator.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
from azurelinuxagent.common import conf
from azurelinuxagent.common import logger
from azurelinuxagent.common.cgroup import CpuCgroup, AGENT_NAME_TELEMETRY, MetricsCounter
from azurelinuxagent.common.cgroupapi import CGroupsApi, SystemdCgroupsApi, SystemdRunError
from azurelinuxagent.common.cgroupapi import CGroupsApi, SystemdCgroupsApi, SystemdRunError, EXTENSION_SLICE_PREFIX
from azurelinuxagent.common.cgroupstelemetry import CGroupsTelemetry
from azurelinuxagent.common.exception import ExtensionErrorCodes, CGroupsException
from azurelinuxagent.common.future import ustr
Expand All @@ -32,14 +32,14 @@
from azurelinuxagent.common.utils.extensionprocessutil import handle_process_completion
from azurelinuxagent.common.event import add_event, WALAEventOperation

_AZURE_SLICE = "azure.slice"
AZURE_SLICE = "azure.slice"
_AZURE_SLICE_CONTENTS = """
[Unit]
Description=Slice for Azure VM Agent and Extensions
DefaultDependencies=no
Before=slices.target
"""
_VMEXTENSIONS_SLICE = "azure-vmextensions.slice"
_VMEXTENSIONS_SLICE = EXTENSION_SLICE_PREFIX + ".slice"
_VMEXTENSIONS_SLICE_CONTENTS = """
[Unit]
Description=Slice for Azure VM Extensions
Expand All @@ -48,6 +48,31 @@
[Slice]
CPUAccounting=yes
"""
_EXTENSION_SLICE_CONTENTS = """
[Unit]
Description=Slice for Azure VM extension {extension_name}
DefaultDependencies=no
Before=slices.target
[Slice]
CPUAccounting=yes
"""
LOGCOLLECTOR_SLICE = "azure-walinuxagent-logcollector.slice"
# More info on resource limits properties in systemd here:
# https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/resource_management_guide/sec-modifying_control_groups
_LOGCOLLECTOR_SLICE_CONTENTS_FMT = """
[Unit]
Description=Slice for Azure VM Agent Periodic Log Collector
DefaultDependencies=no
Before=slices.target
[Slice]
CPUAccounting=yes
CPUQuota={cpu_quota}
MemoryAccounting=yes
MemoryLimit={memory_limit}
"""
_LOGCOLLECTOR_CPU_QUOTA = "5%"
_LOGCOLLECTOR_MEMORY_LIMIT = "30M" # K for kb, M for mb

_AGENT_DROP_IN_FILE_SLICE = "10-Slice.conf"
_AGENT_DROP_IN_FILE_SLICE_CONTENTS = """
# This drop-in unit file was created by the Azure VM Agent.
Expand Down Expand Up @@ -129,7 +154,7 @@ def initialize(self):

agent_unit_name = systemd.get_agent_unit_name()
agent_slice = systemd.get_unit_property(agent_unit_name, "Slice")
if agent_slice not in (_AZURE_SLICE, "system.slice"):
if agent_slice not in (AZURE_SLICE, "system.slice"):
_log_cgroup_warning("The agent is within an unexpected slice: {0}", agent_slice)
return

Expand Down Expand Up @@ -275,8 +300,9 @@ def __setup_azure_slice():
CGroupConfigurator._Impl.__cleanup_unit_file("/etc/systemd/system/system-walinuxagent.extensions.slice")

unit_file_install_path = systemd.get_unit_file_install_path()
azure_slice = os.path.join(unit_file_install_path, _AZURE_SLICE)
azure_slice = os.path.join(unit_file_install_path, AZURE_SLICE)
vmextensions_slice = os.path.join(unit_file_install_path, _VMEXTENSIONS_SLICE)
logcollector_slice = os.path.join(unit_file_install_path, LOGCOLLECTOR_SLICE)
agent_unit_file = systemd.get_agent_unit_file()
agent_drop_in_path = systemd.get_agent_drop_in_path()
agent_drop_in_file_slice = os.path.join(agent_drop_in_path, _AGENT_DROP_IN_FILE_SLICE)
Expand All @@ -290,6 +316,12 @@ def __setup_azure_slice():
if not os.path.exists(vmextensions_slice):
files_to_create.append((vmextensions_slice, _VMEXTENSIONS_SLICE_CONTENTS))

if not os.path.exists(logcollector_slice):
slice_contents = _LOGCOLLECTOR_SLICE_CONTENTS_FMT.format(cpu_quota=_LOGCOLLECTOR_CPU_QUOTA,
memory_limit=_LOGCOLLECTOR_MEMORY_LIMIT)

files_to_create.append((logcollector_slice, slice_contents))

if fileutil.findre_in_file(agent_unit_file, r"Slice=") is not None:
CGroupConfigurator._Impl.__cleanup_unit_file(agent_drop_in_file_slice)
else:
Expand Down Expand Up @@ -500,39 +532,50 @@ def _check_processes_in_agent_cgroup(self):
systemd_run_commands.update(self._cgroups_api.get_systemd_run_commands())

for process in agent_cgroup:
# Note that the agent uses systemd-run to start extensions; systemd-run belongs to the agent cgroup, though the extensions don't
# Note that the agent uses systemd-run to start extensions; systemd-run belongs to the agent cgroup, though the extensions don't.
if process in (daemon, extension_handler) or process in systemd_run_commands:
continue
# systemd_run_commands contains the shell that started systemd-run, so we also need to check for the parent
if self._get_parent(process) in systemd_run_commands and self._get_command(process) == 'systemd-run':
continue
# check if the process is a command started by the agent or a descendant of one of those commands
current = process
while current != 0 and current not in agent_commands:
current = self._get_parent(current)
if current == 0:
unexpected.append(process)
unexpected.append(self.__format_process(process))
if len(unexpected) >= 5: # collect just a small sample
break
except Exception as exception:
_log_cgroup_warning("Error checking the processes in the agent's cgroup: {0}".format(ustr(exception)))

if len(unexpected) > 0:
raise CGroupsException("The agent's cgroup includes unexpected processes: {0}".format(self.__format_processes(unexpected)))
raise CGroupsException("The agent's cgroup includes unexpected processes: {0}".format(unexpected))

@staticmethod
def _get_command(pid):
try:
with open('/proc/{0}/comm'.format(pid), "r") as file_:
comm = file_.read()
if comm and comm[-1] == '\x00': # if null-terminated, remove the null
comm = comm[:-1]
return comm.rstrip()
except Exception:
return "UNKNOWN"

@staticmethod
def __format_processes(pid_list):
def __format_process(pid):
"""
Formats the given PIDs as a sequence of strings containing the PIDs and their corresponding command line (truncated to 40 chars)
Formats the given PID as a string containing the PID and the corresponding command line truncated to 64 chars
"""
def get_command_line(pid):
try:
cmdline = '/proc/{0}/cmdline'.format(pid)
if os.path.exists(cmdline):
with open(cmdline, "r") as cmdline_file:
return "[PID: {0}] {1:64.64}".format(pid, cmdline_file.read())
except Exception:
pass
return "[PID: {0}] UNKNOWN".format(pid)

return [get_command_line(pid) for pid in pid_list]
try:
cmdline = '/proc/{0}/cmdline'.format(pid)
if os.path.exists(cmdline):
with open(cmdline, "r") as cmdline_file:
return "[PID: {0}] {1:64.64}".format(pid, cmdline_file.read())
except Exception:
pass
return "[PID: {0}] UNKNOWN".format(pid)

@staticmethod
def _check_agent_throttled_time(cgroup_metrics):
Expand All @@ -555,11 +598,12 @@ def _get_parent(pid):
pass
return 0

def start_extension_command(self, extension_name, command, timeout, shell, cwd, env, stdout, stderr, error_code=ExtensionErrorCodes.PluginUnknownFailure):
def start_extension_command(self, extension_name, command, cmd_name, timeout, shell, cwd, env, stdout, stderr, error_code=ExtensionErrorCodes.PluginUnknownFailure):
"""
Starts a command (install/enable/etc) for an extension and adds the command's PID to the extension's cgroup
:param extension_name: The extension executing the command
:param command: The command to invoke
:param cmd_name: The type of the command(enable, install, etc.)
:param timeout: Number of seconds to wait for command completion
:param cwd: The working directory for the command
:param env: The environment to pass to the command's process
Expand All @@ -570,7 +614,7 @@ def start_extension_command(self, extension_name, command, timeout, shell, cwd,
"""
if self.enabled():
try:
return self._cgroups_api.start_extension_command(extension_name, command, timeout, shell=shell, cwd=cwd, env=env, stdout=stdout, stderr=stderr, error_code=error_code)
return self._cgroups_api.start_extension_command(extension_name, command, cmd_name, timeout, shell=shell, cwd=cwd, env=env, stdout=stdout, stderr=stderr, error_code=error_code)
except SystemdRunError as exception:
reason = 'Failed to start {0} using systemd-run, will try invoking the extension directly. Error: {1}'.format(extension_name, ustr(exception))
self.disable(reason)
Expand All @@ -580,6 +624,45 @@ def start_extension_command(self, extension_name, command, timeout, shell, cwd,
process = subprocess.Popen(command, shell=shell, cwd=cwd, env=env, stdout=stdout, stderr=stderr, preexec_fn=os.setsid) # pylint: disable=W1509
return handle_process_completion(process=process, command=command, timeout=timeout, stdout=stdout, stderr=stderr, error_code=error_code)

def setup_extension_slice(self, extension_name):
"""
Each extension runs under its own slice (Ex "Microsoft.CPlat.Extension.slice"). All the slices for
extensions are grouped under "azure-vmextensions.slice.

This method ensures that the extension slice is created. Setup should create
under /lib/systemd/system if it is not exist.
"""
if self.enabled():
unit_file_install_path = systemd.get_unit_file_install_path()
extension_slice_path = os.path.join(unit_file_install_path,
SystemdCgroupsApi.get_extension_cgroup_name(extension_name) + ".slice")
if not os.path.exists(extension_slice_path):
try:
slice_contents = _EXTENSION_SLICE_CONTENTS.format(extension_name = extension_name)
CGroupConfigurator._Impl.__create_unit_file(extension_slice_path, slice_contents)
except Exception as exception:
_log_cgroup_warning("Failed to create unit files for the extension slice: {0}", ustr(exception))
CGroupConfigurator._Impl.__cleanup_unit_file(extension_slice_path)

def remove_extension_slice(self, extension_name):
"""
This method ensures that the extension slice gets removed from /lib/systemd/system if it exist
Lastly stop the unit. This would ensure the cleanup the /sys/fs/cgroup controller paths
"""
if self.enabled():
unit_file_install_path = systemd.get_unit_file_install_path()
extension_slice_name = SystemdCgroupsApi.get_extension_cgroup_name(extension_name) + ".slice"
extension_slice_path = os.path.join(unit_file_install_path, extension_slice_name)
if os.path.exists(extension_slice_path):
CGroupConfigurator._Impl.__cleanup_unit_file(extension_slice_path)
# stop the unit gracefully; the extensions slices will be removed from /sys/fs/cgroup path
try:
logger.info("Executing systemctl stop {0}".format(extension_slice_name))
shellutil.run_command(["systemctl", "stop", extension_slice_name])
except Exception as exception:
_log_cgroup_warning("systemctl stop failed (remove slice): {0}", ustr(exception))


# unique instance for the singleton
_instance = None

Expand Down
Loading