[Sys Mon] Fix the service entry delete in state_db because of timer job #14702

vivekrnv · 2023-04-18T15:32:17Z

Why I did it

systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service.

Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers

root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp
root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status 
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
<Truncated>
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -
<Truncated>

Expected

Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB

root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status 
System is not ready - one or more services are not up

Service-Name            Service-Status    App-Ready-Status    Down-Reason
----------------------  ----------------  ------------------  -------------
<Truncated>
snmp                    Down              Down                Inactive
ssh                     OK                OK                  -
swss                    OK                OK                  -
syncd                   OK                OK                  -
sysstat                 OK                OK                  -
teamd                   OK                OK                  -
telemetry               OK                OK                  -
what-just-happened      OK                OK                  -
ztp                     OK                OK                  -
<Truncated>

How I did it

Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this

Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ] 
Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead]

Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47
Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ]

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

liat-grozovik · 2023-04-23T12:20:39Z

@lguohan could you pls suggest reviewer?

liat-grozovik · 2023-04-23T12:22:16Z

@sg893052 could you please help to review the issue?

liat-grozovik · 2023-04-27T14:48:04Z

@lguohan @qiluo-msft could you please help to merge?

…ob (sonic-net#14702) Why I did it systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service. Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> Expected Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> snmp Down Down Inactive ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> How I did it Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ] Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead] Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ] Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

mssonicbld · 2023-05-17T20:36:03Z

Cherry-pick PR to 202205: #15121

…ob (sonic-net#14702) Why I did it systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service. Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> Expected Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> snmp Down Down Inactive ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> How I did it Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ] Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead] Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ] Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

mssonicbld · 2023-05-17T20:36:32Z

Cherry-pick PR to 202211: #15122

…ob (#14702) Why I did it systemd stop event on service with timers can sometime delete the state_db entry for the corresponding service. Note: This won't be observed on the latest master label since the dependency on timer was removed with the recent config reload enhancement. However, it is better to have the fix since there might be some systemd services added to system health daemon in the future which may contain timers root@qa-eth-vt01-4-3700c:/home/admin# systemctl stop snmp root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> Expected Should see a Down entry for SNMP instead of the entry being deleted from the STATE_DB root@qa-eth-vt01-4-3700c:/home/admin# show system-health sysready-status System is not ready - one or more services are not up Service-Name Service-Status App-Ready-Status Down-Reason ---------------------- ---------------- ------------------ ------------- <Truncated> snmp Down Down Inactive ssh OK OK - swss OK OK - syncd OK OK - sysstat OK OK - teamd OK OK - telemetry OK OK - what-just-happened OK OK - ztp OK OK - <Truncated> How I did it Happens because the timer is usually a PartOf service and thus a stop on service is propagated to timer. Fixed the logic to handle this Apr 18 02:06:47.711252 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.service from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.711347 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.service ] Apr 18 02:06:47.722363 r-lionfish-16 INFO healthd: snmp.service service state changed to [inactive/dead] Apr 18 02:06:47.723230 r-lionfish-16 DEBUG healthd: Main process- received event:snmp.timer from source:sysbus time:2023-04-17 23:06:47 Apr 18 02:06:47.723328 r-lionfish-16 INFO healthd: check_unit_status for [ snmp.timer ] Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

[Sys Mon] Fix the service entry delete in state_db because of timer job

8c79c04

Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>

vivekrnv requested a review from lguohan as a code owner April 18, 2023 15:32

vivekrnv added the Request for 202211 Branch label Apr 18, 2023

vivekrnv requested a review from Junchao-Mellanox April 18, 2023 15:33

Junchao-Mellanox approved these changes Apr 19, 2023

View reviewed changes

sg893052 approved these changes Apr 24, 2023

View reviewed changes

vivekrnv added the Request for 202205 Branch label Apr 25, 2023

lguohan merged commit 22b4aac into sonic-net:master Apr 27, 2023

yxieca added the Approved for 202205 Branch label May 17, 2023

mssonicbld added the Created PR to 202205 Branch label May 17, 2023

yxieca added Approved for 202211 Branch and removed Created PR to 202205 Branch labels May 17, 2023

mssonicbld mentioned this pull request May 17, 2023

[action] [PR:14702] [Sys Mon] Fix the service entry delete in state_db because of timer job #15121

Merged

10 tasks

mssonicbld added the Created PR to 202211 Branch label May 17, 2023

mssonicbld mentioned this pull request May 17, 2023

[action] [PR:14702] [Sys Mon] Fix the service entry delete in state_db because of timer job #15122

Merged

10 tasks

mssonicbld added Included in 202211 Branch Included in 202205 Branch and removed Approved for 202211 Branch Created PR to 202211 Branch Approved for 202205 Branch labels May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Sys Mon] Fix the service entry delete in state_db because of timer job #14702

[Sys Mon] Fix the service entry delete in state_db because of timer job #14702

vivekrnv commented Apr 18, 2023 •

edited

Loading

liat-grozovik commented Apr 23, 2023

liat-grozovik commented Apr 23, 2023

liat-grozovik commented Apr 27, 2023

mssonicbld commented May 17, 2023

mssonicbld commented May 17, 2023

[Sys Mon] Fix the service entry delete in state_db because of timer job #14702

[Sys Mon] Fix the service entry delete in state_db because of timer job #14702

Conversation

vivekrnv commented Apr 18, 2023 • edited Loading

Why I did it

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

liat-grozovik commented Apr 23, 2023

liat-grozovik commented Apr 23, 2023

liat-grozovik commented Apr 27, 2023

mssonicbld commented May 17, 2023

mssonicbld commented May 17, 2023

vivekrnv commented Apr 18, 2023 •

edited

Loading