Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system-health] Add support for monitor system health #19

Closed
wants to merge 22 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
f3d3fb5
system health first commit
Junchao-Mellanox Jun 4, 2020
63623a7
system health daemon first commit
Junchao-Mellanox Jun 4, 2020
e988130
Finish healthd
Junchao-Mellanox Jun 5, 2020
7ed33df
Changes due to lower layer logic change
Junchao-Mellanox Jun 8, 2020
fd301e6
Get ASIC temperature from TEMPERATURE_INFO table
Junchao-Mellanox Jun 9, 2020
77d57cc
Add system health make rule and service files
Junchao-Mellanox Jun 9, 2020
ae00266
fix bugs found during manual test
Junchao-Mellanox Jun 9, 2020
ad8a740
Change make file to install system-health library to host
Junchao-Mellanox Jun 10, 2020
cf861fe
Set system LED to blink on bootup time
Junchao-Mellanox Jun 11, 2020
7eb6082
Caught exceptions in system health checker to make it more robust
Junchao-Mellanox Jun 11, 2020
91c43f0
fix issue that fan/psu presence will always be true
Junchao-Mellanox Jun 11, 2020
509fa5c
fix issue for external checker
Junchao-Mellanox Jun 11, 2020
d88515d
move system-health service to right after rc-local service
Junchao-Mellanox Jun 11, 2020
a198cc5
Set system-health service start after database service
Junchao-Mellanox Jun 15, 2020
30b4668
Get system up time via /proc/uptime
Junchao-Mellanox Jun 16, 2020
8fea891
Provide more information in stat for CLI to use
Junchao-Mellanox Jun 16, 2020
0134052
fix typo
Junchao-Mellanox Jun 16, 2020
f1def48
Set default category to External for external checker
Junchao-Mellanox Jun 17, 2020
7123b8e
If external checker reported OK, save it to stat too
Junchao-Mellanox Jun 17, 2020
d68a43c
Trim string for external checker output
Junchao-Mellanox Jun 17, 2020
b24c6f8
fix issue: PSU voltage check always return OK
Junchao-Mellanox Jun 18, 2020
d9d125d
Add unit test cases for system health library
Junchao-Mellanox Jun 23, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions files/build_templates/sonic_debian_extension.j2
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,12 @@ sudo cp {{daemon_base_py2_wheel_path}} $FILESYSTEM_ROOT/$DAEMON_BASE_PY2_WHEEL_N
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip install $DAEMON_BASE_PY2_WHEEL_NAME
sudo rm -rf $FILESYSTEM_ROOT/$DAEMON_BASE_PY2_WHEEL_NAME

# Install system-health Python 2 package
SYSTEM_HEALTH_PY2_WHEEL_NAME=$(basename {{system_health_py2_wheel_path}})
sudo cp {{system_health_py2_wheel_path}} $FILESYSTEM_ROOT/$SYSTEM_HEALTH_PY2_WHEEL_NAME
sudo https_proxy=$https_proxy LANG=C chroot $FILESYSTEM_ROOT pip install $SYSTEM_HEALTH_PY2_WHEEL_NAME
sudo rm -rf $FILESYSTEM_ROOT/$SYSTEM_HEALTH_PY2_WHEEL_NAME

# Install built Python Click package (and its dependencies via 'apt-get -y install -f')
# Do this before installing sonic-utilities so that it doesn't attempt to install
# an older version as part of its dependencies
Expand Down Expand Up @@ -243,6 +249,10 @@ sudo mkdir -p $FILESYSTEM_ROOT/etc/systemd/system/syslog.socket.d
sudo cp $IMAGE_CONFIGS/syslog/override.conf $FILESYSTEM_ROOT/etc/systemd/system/syslog.socket.d/override.conf
sudo cp $IMAGE_CONFIGS/syslog/host_umount.sh $FILESYSTEM_ROOT/usr/bin/

# Copy system-health files
sudo LANG=C cp $IMAGE_CONFIGS/system-health/system-health.service $FILESYSTEM_ROOT_USR_LIB_SYSTEMD_SYSTEM
echo "system-health.service" | sudo tee -a $GENERATED_SERVICE_FILE

# Copy logrotate.d configuration files
sudo cp -f $IMAGE_CONFIGS/logrotate/logrotate.d/* $FILESYSTEM_ROOT/etc/logrotate.d/

Expand Down
11 changes: 11 additions & 0 deletions files/image_config/system-health/system-health.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
[Unit]
Description=Monitor system health
Requires=database.service updategraph.service
After=database.service updategraph.service

[Service]
ExecStart=/usr/local/bin/healthd
Restart=always

[Install]
WantedBy=multi-user.target
9 changes: 9 additions & 0 deletions rules/system-health.mk
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# system health python2 wheel

SYSTEM_HEALTH = system_health-1.0-py2-none-any.whl
$(SYSTEM_HEALTH)_SRC_PATH = $(SRC_PATH)/system-health
$(SYSTEM_HEALTH)_PYTHON_VERSION = 2
$(SYSTEM_HEALTH)_DEPENDS = $(SONIC_DAEMON_BASE_PY2) $(SWSSSDK_PY2) $(SONIC_CONFIG_ENGINE)
SONIC_PYTHON_WHEELS += $(SYSTEM_HEALTH)

export system_health_py2_wheel_path="$(addprefix $(PYTHON_WHEELS_PATH)/,$(SYSTEM_HEALTH))"
3 changes: 2 additions & 1 deletion slave.mk
Original file line number Diff line number Diff line change
Expand Up @@ -786,7 +786,8 @@ $(addprefix $(TARGET_PATH)/, $(SONIC_INSTALLERS)) : $(TARGET_PATH)/% : \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(REDIS_DUMP_LOAD_PY2)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_PLATFORM_API_PY2)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_YANG_MODELS_PY3)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_YANG_MGMT_PY))
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SONIC_YANG_MGMT_PY)) \
$(addprefix $(PYTHON_WHEELS_PATH)/,$(SYSTEM_HEALTH))
$(HEADER)
# Pass initramfs and linux kernel explicitly. They are used for all platforms
export debs_path="$(IMAGE_DISTRO_DEBS_PATH)"
Expand Down
8 changes: 8 additions & 0 deletions src/system-health/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
*/deb_dist/
*/dist/
*/build/
*/*.tar.gz
*/*.egg-info
*/.cache/
*.pyc
*/__pycache__/
2 changes: 2 additions & 0 deletions src/system-health/health_checker/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
from . import hardware_checker
from . import service_checker
88 changes: 88 additions & 0 deletions src/system-health/health_checker/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import os
import json
from sonic_device_util import get_machine_info
from sonic_device_util import get_platform_info


class Config(object):
DEFAULT_INTERVAL = 60
DEFAULT_BOOTUP_TIMEOUT = 300
DEFAULT_LED_CONFIG = {
'fault': 'red',
'normal': 'green',
'booting': 'orange_blink'
}
GET_PLATFORM_CMD = 'sonic-cfggen -d -v DEVICE_METADATA.localhost.platform'
CONFIG_FILE = 'system_health_monitoring_config.json'

def __init__(self):
mi = get_machine_info()
if mi is not None:
self.platform_name = get_platform_info(mi)
else:
self.platform_name = self._get_platform_name()
self._config_file = os.path.join('/usr/share/sonic/device/', self.platform_name, Config.CONFIG_FILE)
self._last_mtime = None
self.config_data = None
self.interval = Config.DEFAULT_INTERVAL
self.ignore_services = None
self.ignore_devices = None
self.external_checkers = None

def load_config(self):
if not os.path.exists(self._config_file):
if self._last_mtime is not None:
self._reset()
return

mtime = os.stat(self._config_file)
if mtime != self._last_mtime:
try:
self._last_mtime = mtime
with open(self._config_file, 'r') as f:
self.config_data = json.load(f)

self.interval = self.config_data.get('polling_interval', Config.DEFAULT_INTERVAL)
self.ignore_services = self._get_list_data('services_to_ignore')
self.ignore_devices = self._get_list_data('devices_to_ignore')
self.external_checkers = self._get_list_data('external_checkers')
except Exception as e:
self._reset()

def _reset(self):
self._last_mtime = None
self.config_data = None
self.interval = Config.DEFAULT_INTERVAL
self.ignore_services = None
self.ignore_devices = None
self.external_checkers = None

def get_led_color(self, status):
if self.config_data and 'led_color' in self.config_data:
if status in self.config_data['led_color']:
return self.config_data['led_color'][status]

return self.DEFAULT_LED_CONFIG[status]

def get_bootup_timeout(self):
if self.config_data and 'boot_timeout' in self.config_data:
try:
timeout = int(self.config_data['boot_timeout'])
if timeout <= 0:
timeout = self.DEFAULT_BOOTUP_TIMEOUT
return timeout
except ValueError:
pass
return self.DEFAULT_BOOTUP_TIMEOUT

def _get_platform_name(self):
from .utils import run_command
output = run_command(Config.GET_PLATFORM_CMD)
return output.strip()

def _get_list_data(self, key):
if key in self.config_data:
data = self.config_data[key]
if isinstance(data, list):
return set(data)
return None
63 changes: 63 additions & 0 deletions src/system-health/health_checker/external_checker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
from .health_checker import HealthChecker
from . import utils


class ExternalChecker(HealthChecker):
def __init__(self, cmd):
HealthChecker.__init__(self)
self._cmd = cmd
self._category = None

def reset(self):
self._category = 'External'
self._info = {}

def get_category(self):
return self._category

def check(self, config):
self.reset()

output = utils.run_command(self._cmd)
if not output:
self.set_object_not_ok('External', str(self), 'Failed to get output of command \"{}\"'.format(self._cmd))
return

output = output.strip()
if not output:
self.set_object_not_ok('External', str(self), 'Failed to get output of command \"{}\"'.format(self._cmd))
return

raw_lines = output.splitlines()
if not raw_lines:
self.set_object_not_ok('External', str(self), 'Invalid output of command \"{}\"'.format(self._cmd))
return

lines = []
for line in raw_lines:
line = line.strip()
if not line:
continue

lines.append(line)

if not lines:
self.set_object_not_ok('External', str(self), 'Invalid output of command \"{}\"'.format(self._cmd))
return

self._category = lines[0]
if len(lines) > 1:
for line in lines[1:]:
pos = line.find(':')
if pos == -1:
continue
obj_name = line[:pos].strip()
msg = line[pos+1:].strip()
if msg != 'OK':
self.set_object_not_ok('External', obj_name, msg)
else:
self.set_object_ok('External', obj_name)
return

def __str__(self):
return 'ExternalChecker - {}'.format(self._cmd)
Loading