Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System Ready #10479

Merged
merged 6 commits into from
May 20, 2022
Merged

System Ready #10479

merged 6 commits into from
May 20, 2022

Conversation

sg893052
Copy link
Contributor

@sg893052 sg893052 commented Apr 6, 2022

Why I did it

At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it

A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it

"show system-health sysready-status" click CLI
Syslogs for system ready

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@lgtm-com
Copy link

lgtm-com bot commented Apr 6, 2022

This pull request introduces 15 alerts when merging 590ee50a9547bf9d9fd88b478aeefdca37e0f571 into b152f2a - view on LGTM.com

new alerts:

  • 9 for Unused import
  • 2 for Variable defined multiple times
  • 1 for Wrong name for an argument in a call
  • 1 for Testing equality to None
  • 1 for Unused local variable
  • 1 for Unreachable code

@Junchao-Mellanox
Copy link
Collaborator

Could you please add comments for each new function/class?

@Junchao-Mellanox
Copy link
Collaborator

Could you please fix the LGTM issues?

@Junchao-Mellanox
Copy link
Collaborator

Could please re-format files? I see many format issues, for example:

SYSLOG_IDENTIFIER="system#monitor"

should be

SYSLOG_IDENTIFIER = "system#monitor"

@Junchao-Mellanox
Copy link
Collaborator

Junchao-Mellanox commented Apr 8, 2022

Could you please remove unused comments?

@@ -56,6 +56,9 @@
"has_global_scope": {% if feature + '.service' in installer_services.split(' ') %}true{% else %}false{% endif %},
"has_per_asic_scope": {% if feature + '@.service' in installer_services.split(' ') %}true{% else %}false{% endif %},
"auto_restart": "{{autorestart}}",
{%- if feature in ["bgp", "swss", "pmon"] %}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add comment to explain the purpose here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

import multiprocessing
from collections import OrderedDict
import tempfile
from swsssdk import ConfigDBConnector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use ConfigDBConnector in swsscommon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

mpmgr = multiprocessing.Manager()

#Logger class for syslog
class Logger(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use sonic_py_common.logger.Logger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

#Initalise the syslog infrastructure
logger = Logger(SYSLOG_IDENTIFIER)

class Dict2Obj(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a special reason to use this class instead of directly using dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed


def unlock(self):
fcntl.flock(self.f, fcntl.LOCK_UN)
#self.f.close()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the file being closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed lock as it is not need here.

if self.task_stopping_event.is_set():
return
self.task_queue.put(msg)
time.sleep(0.001)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

time.sleep(0.001)

def task_stop(self):
self.task_stopping_event.set()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you need stop the event loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the event loop runs, as ProcessTaskBase in src/sonic-py-common/sonic_py_common/task_base.py has task_stop which actually kills the subprocess if not joined(exited) on time.

def task_stop(self):
    # Signal the process to stop
    self.task_stopping_event.set()

    # Wait for the process to exit
    self._task_process.join(self._stop_timeout_secs)

    # If the process didn't exit, attempt to kill it
    if self._task_process.is_alive():
        os.kill(self._task_process.pid, signal.SIGKILL)

    if self._task_process.is_alive():
        return False


update_proc_name('sysmonitor')
event = 'SERVICE_EVENT'
sysready_lock.lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the lock for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed lock as it is not needed here.

# Queue to receive the STATEDB and Systemd state change event
while True:
try:
msg = myQ.get(timeout=15)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use constant value instead of magic number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -72,6 +73,8 @@ class HealthDaemon(DaemonBase):
import sonic_platform.platform
chassis = sonic_platform.platform.Platform().get_chassis()
manager = HealthCheckerManager()
sysmon = Sysmonitor()
sysmon.system_service()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sysmon.system_service is a infinite loop, the rest of the code in this function will never run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken care.

@Junchao-Mellanox
Copy link
Collaborator

could you please add unit test?

@lgtm-com
Copy link

lgtm-com bot commented Apr 24, 2022

This pull request introduces 4 alerts when merging 9388306a2644f61c903aa5b918f7ba5564af4af0 into 5cd6bc4 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Use of the return value of a procedure
  • 1 for Unused local variable

SYSTEM_READY_TABLE = 'SYSTEM_READY'
SYSTEM_ALLSRV_STATE="DOWN"
SYSREADY_LOCKFILE="/var/run/sysready.lock"
dnsrvs_name_list=[]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate the reason to keep it as a list? Query/remove an element from set is faster than list.


#add the services from the below targets
path= ["/etc/systemd/system/multi-user.target.wants", "/etc/systemd/system/sonic.target.wants"]
for p in path:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand.

global dnsrvs_name_list
global allsrvs_dict
global allsrvs_status
global spl_srv_list
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spl_srv_list not used.

def run_systemctl_show(service):
command = ('systemctl show {} --property=Id,LoadState,UnitFileState,Type,ActiveState,SubState,Result'.format(service))
output = utils.run_command(command)
srv_properties = output.split('\n')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use output.splitlines()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, used glob in place of listdir and modified dnsrvs_name from list to set as per suggestion.


def post_unit_status(st_db, srv_name, srv_status, app_status, fail_reason, update_time):
key = 'ALL_SERVICE_STATUS|{}'.format(srv_name)
st_db.set(st_db.STATE_DB, key, 'service_status', srv_status)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you consider using hmset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lgtm-com
Copy link

lgtm-com bot commented Apr 25, 2022

This pull request introduces 1 alert when merging ee0818c6f8ddc19ae926e7a128016717bd515f84 into 3fc3259 - view on LGTM.com

new alerts:

  • 1 for Unused import

@sg893052 sg893052 force-pushed the SYSTEM_READY branch 3 times, most recently from e6f68dd to 001b7ed Compare April 29, 2022 01:55
@lgtm-com
Copy link

lgtm-com bot commented Apr 29, 2022

This pull request introduces 1 alert when merging 001b7ed92156e6ff9099f8a254a0c711ddee122c into e0f5333 - view on LGTM.com

new alerts:

  • 1 for Unused import

@sg893052 sg893052 force-pushed the SYSTEM_READY branch 2 times, most recently from 95e5686 to 9782dbe Compare May 5, 2022 12:59
@sg893052
Copy link
Contributor Author

sg893052 commented May 5, 2022

Could please re-format files? I see many format issues, for example:

SYSLOG_IDENTIFIER="system#monitor"

should be

SYSLOG_IDENTIFIER = "system#monitor"

Taken care of all the format issues

@sg893052
Copy link
Contributor Author

sg893052 commented May 5, 2022

Could you please add comments for each new function/class?

Added comments for all functions and classes

@sg893052
Copy link
Contributor Author

sg893052 commented May 5, 2022

could you please add unit test?

Have added unit test coverage for the system ready code in test_system_health.py
Also, attaching the manual UT performed runtime on a target.

Manual UT runtime - system ready.txt

@adyeung
Copy link
Collaborator

adyeung commented May 6, 2022

/azpw run Azure.sonic-buildimage

1 similar comment
@sg893052
Copy link
Contributor Author

sg893052 commented May 7, 2022

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Bring in all global functions under Sysmonitor class
Add unit test code coverage for sysmonitor code
Formatted the code
@zhangyanzhao
Copy link
Collaborator

/azpw run Azure.sonic-buildimage

1 similar comment
@sg893052
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sg893052
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sg893052
Copy link
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sg893052
Copy link
Contributor Author

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 10479 in repo Azure/sonic-buildimage

@sg893052
Copy link
Contributor Author

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 10479 in repo Azure/sonic-buildimage

@liushilongbuaa
Copy link
Contributor

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@zhangyanzhao
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@sujinmkang sujinmkang merged commit f37dd77 into sonic-net:master May 20, 2022
description "This configuration controls the system ready tool to check
the app ready/up status";
type boolean;
default false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add yang model unit test?
We recently fixed a bug #10986

liushilongbuaa pushed a commit to liushilongbuaa/sonic-buildimage that referenced this pull request Jun 20, 2022
…anch

Related work items: #52, #71, #73, #75, #77, sonic-net#1306, sonic-net#1588, sonic-net#1991, sonic-net#2031, sonic-net#2040, sonic-net#2053, sonic-net#2066, sonic-net#2069, sonic-net#2087, sonic-net#2107, sonic-net#2110, sonic-net#2112, sonic-net#2113, sonic-net#2117, sonic-net#2124, sonic-net#2125, sonic-net#2126, sonic-net#2128, sonic-net#2130, sonic-net#2131, sonic-net#2132, sonic-net#2133, sonic-net#2134, sonic-net#2135, sonic-net#2136, sonic-net#2137, sonic-net#2138, sonic-net#2139, sonic-net#2140, sonic-net#2143, sonic-net#2158, sonic-net#2161, sonic-net#2233, sonic-net#2243, sonic-net#2250, sonic-net#2254, sonic-net#2260, sonic-net#2261, sonic-net#2267, sonic-net#2278, sonic-net#2282, sonic-net#2285, sonic-net#2288, sonic-net#2289, sonic-net#2292, sonic-net#2294, sonic-net#8887, sonic-net#9279, sonic-net#9390, sonic-net#9511, sonic-net#9700, sonic-net#10025, sonic-net#10322, sonic-net#10479, sonic-net#10484, sonic-net#10493, sonic-net#10500, sonic-net#10580, sonic-net#10595, sonic-net#10628, sonic-net#10634, sonic-net#10635, sonic-net#10644, sonic-net#10670, sonic-net#10691, sonic-net#10716, sonic-net#10731, sonic-net#10750, sonic-net#10751, sonic-net#10752, sonic-net#10761, sonic-net#10769, sonic-net#10775, sonic-net#10776, sonic-net#10779, sonic-net#10786, sonic-net#10792, sonic-net#10793, sonic-net#10800, sonic-net#10806, sonic-net#10826, sonic-net#10839, sonic-net#10840, sonic-net#10842, sonic-net#10844, sonic-net#10847, sonic-net#10849, sonic-net#10852, sonic-net#10865, sonic-net#10872, sonic-net#10877, sonic-net#10886, sonic-net#10889, sonic-net#10903, sonic-net#10904, sonic-net#10905, sonic-net#10913, sonic-net#10914, sonic-net#10916, sonic-net#10919, sonic-net#10925, sonic-net#10926, sonic-net#10929, sonic-net#10933, sonic-net#10934, sonic-net#10937, sonic-net#10941, sonic-net#10947, sonic-net#10952, sonic-net#10953, sonic-net#10957, sonic-net#10959, sonic-net#10971, sonic-net#10972, sonic-net#10980
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants