Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[system-health] No longer check critical process/service status via monit #9068

Merged
merged 9 commits into from
Nov 23, 2021

Conversation

Junchao-Mellanox
Copy link
Collaborator

@Junchao-Mellanox Junchao-Mellanox commented Oct 26, 2021

HLD updated here: sonic-net/SONiC#887

Why I did it

Command monit summary -B can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

How I did it

  1. Get container names from FEATURE table
  2. For each container, collect critical process names from file critical_processes
  3. Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

How to verify it

  1. Add unit test case to cover it
  2. Adjust sonic-mgmt cases to cover it
  3. Manual test

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

keboliu
keboliu previously approved these changes Oct 26, 2021
@lgtm-com
Copy link

lgtm-com bot commented Oct 26, 2021

This pull request introduces 1 alert when merging 5a8d671 into b0c73d9 - view on LGTM.com

new alerts:

  • 1 for Unnecessary pass

@Junchao-Mellanox
Copy link
Collaborator Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft
Copy link
Collaborator

Add a link of your design doc to the PR description.

@Junchao-Mellanox
Copy link
Collaborator Author

Hi @qiluo-msft , I have fixed all review comment, could you please review and sign-off?

@Junchao-Mellanox
Copy link
Collaborator Author

Hi @qiluo-msft , could you please review and sign off?

@Junchao-Mellanox
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Junchao-Mellanox
Copy link
Collaborator Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@qiluo-msft
Copy link
Collaborator

Could you rebase to latest master?

@Junchao-Mellanox
Copy link
Collaborator Author

No clean cherry-pick for 202106/202012, need separate PRs.

@qiluo-msft qiluo-msft merged commit 11a93d2 into sonic-net:master Nov 23, 2021
qiluo-msft pushed a commit that referenced this pull request Nov 24, 2021
…tus via monit (#9367)

Backport #9068 to 202012

#### Why I did it

Command `monit summary -B` can no longer display the status for each critical process, system-health should not depend on it and need find a way to monitor the status of critical processes. The PR is to address that. monit is still used by system-health to do file system check as well as customize check.

#### How I did it

1.	Get container names from FEATURE table
2.	For each container, collect critical process names from file critical_processes
3.	Use “docker exec -it <container_name> bash -c ‘supervisorctl status’” to get processes status inside container, parse the output and check if any critical processes exit

#### How to verify it

1. Add unit test case to cover it
2. Adjust sonic-mgmt cases to cover it
3. Manual test
@yozhao101
Copy link
Contributor

No clean cherry-pick for 202106/202012, need separate PRs.

Can you submit separate PRs against 202012/202106 please?

@Junchao-Mellanox Junchao-Mellanox deleted the system-health-enhance branch December 2, 2021 01:35
@Junchao-Mellanox
Copy link
Collaborator Author

Hi @yozhao101, I suppose they are already in 202012 and 202106. Please check #9366 and #9367

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants