Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UT for orchagent watchdog #8306

Merged
merged 6 commits into from
Jun 14, 2023

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented May 15, 2023

Description of PR

Add UT for orchagent watchdog.

Summary:
SWSS service will add watchdog mechanism to generate keepalive message, and generate alert when swss have issue.
This PR will add new UT to cover the watchdog mechanism.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911
  • 202012
  • 202205

Approach

What is the motivation for this PR?

Add new UT to test and protect watchdog mechanism from code change.

How did you do it?

Pause orchagent service with 'kill -stop' command and check if the watchdog can send alert.

How did you verify/test it?

Manually test new UT.
Pass PR validation.

Any platform specific information?

No

Supported testbed topology if it's a new test case?

Any

Documentation

@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing tests/system_health/test_watchdog.py

fix end of files.........................................................Passed
check yaml...............................................................Passed
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Failed
- hook id: flake8
- exit code: 1

tests/system_health/test_watchdog.py:10:10: F821 undefined name 'logging'
tests/system_health/test_watchdog.py:14:1: E302 expected 2 blank lines, found 1
tests/system_health/test_watchdog.py:17:121: E501 line too long (122 > 120 characters)
tests/system_health/test_watchdog.py:34:120: E203 whitespace before ','
tests/system_health/test_watchdog.py:45:121: E501 line too long (126 > 120 characters)
...
[truncated extra lines, please run pre-commit locally to view full check results]

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Failed
- hook id: trailing-whitespace
- exit code: 1
- files were modified by this hook

Fixing tests/system_health/test_watchdog.py

fix end of files.........................................................Passed
check yaml...............................................................Passed
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Passed
check conditional mark sort..............................................Passed

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@mssonicbld
Copy link
Collaborator

The pre-commit check detected issues in the files touched by this pull request.
The pre-commit check is a mandatory check, please fix detected issues.

Detailed pre-commit check results:
trim trailing whitespace.................................................Passed
fix end of files.........................................................Failed
- hook id: end-of-file-fixer
- exit code: 1
- files were modified by this hook

Fixing tests/system_health/test_watchdog.py

check yaml...............................................................Passed
check for added large files..............................................Passed
check python ast.........................................................Passed
flake8...................................................................Passed
check conditional mark sort..............................................Passed

To run the pre-commit checks locally, you can follow below steps:

  1. Ensure that default python is python3. In sonic-mgmt docker container, default python is python2. You can run
    the check by activating the python3 virtual environment in sonic-mgmt docker container or outside of sonic-mgmt
    docker container.
  2. Ensure that the pre-commit package is installed:
sudo pip install pre-commit
  1. Go to repository root folder
  2. Install the pre-commit hooks:
pre-commit install
  1. Use pre-commit to check staged file:
pre-commit
  1. Alternatively, you can check committed files using:
pre-commit run --from-ref <commit_id> --to-ref <commit_id>

@liuh-80 liuh-80 marked this pull request as ready for review May 15, 2023 02:41
@liuh-80 liuh-80 requested a review from prgeor as a code owner May 15, 2023 02:41
@liuh-80 liuh-80 requested a review from qiluo-msft May 15, 2023 02:41
duthost = duthosts[enum_rand_one_per_hwsku_hostname]

result = duthost.shell(
r"docker exec -i swss sh -c 'test -f /usr/bin/supervisor-proc-watchdog-listener && echo exist'",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liuh-80 can you point me to the watchdog listner code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prgeor here is the listener: sonic-net/sonic-buildimage#14686
I will update the PR later to merge the listener with process exit listener.

liuh-80 added a commit to sonic-net/sonic-swss that referenced this pull request Jun 6, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 6, 2023
…ave issue. (#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
pytest.skip("Skip orchagent watchdog test.")

# wait watchdog emit alert
WATCHDOG_TIMEOUT = 120
Copy link
Contributor

@qiluo-msft qiluo-msft Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

120

Magic number. How do you get the value? Is it too long or possible too short if anyone change code accidently? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The watchdog will send alert after 60 seconds, so I wait for 60*2 here.

def pause_orchagent(duthost):
# find orchagent pid
pid = duthost.shell(
r"ps -ef | grep orchagent | grep -v grep | awk '{print $2}'",
Copy link
Contributor

@qiluo-msft qiluo-msft Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps -ef | grep orchagent | grep

Not a blocking issue, you may consider pgrep. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 13, 2023
…ave issue. (#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
@liuh-80
Copy link
Contributor Author

liuh-80 commented Jun 13, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Contributor Author

liuh-80 commented Jun 14, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@liuh-80
Copy link
Contributor Author

liuh-80 commented Jun 14, 2023

system_health/test_watchdog.py::test_orchagent_watchdog[vlab-01] PASSED [100%]

Will merge PR.

@liuh-80 liuh-80 merged commit 16ca5ad into sonic-net:master Jun 14, 2023
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 20, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
mrkcmo pushed a commit to Azarack/sonic-mgmt that referenced this pull request Oct 3, 2023
### Description of PR
Add UT for orchagent watchdog.

Summary:
SWSS service will add watchdog mechanism to generate keepalive message, and generate alert when swss have issue.
This PR will add new UT to cover the watchdog mechanism.

### Type of change

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [x] Test case(new/improvement)


### Back port request
- [ ] 201911
- [ ] 202012
- [ ] 202205

### Approach
#### What is the motivation for this PR?
Add new UT to test and protect watchdog mechanism from code change.

#### How did you do it?
Pause orchagent service with 'kill -stop' command and check if the watchdog can send alert.

#### How did you verify/test it?
Manually test new UT.
Pass PR validation.

#### Any platform specific information?
No

#### Supported testbed topology if it's a new test case?
Any

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Jan 25, 2024
### Description of PR
Add UT for orchagent watchdog.

Summary:
SWSS service will add watchdog mechanism to generate keepalive message, and generate alert when swss have issue.
This PR will add new UT to cover the watchdog mechanism.

### Type of change

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [x] Test case(new/improvement)


### Back port request
- [ ] 201911
- [ ] 202012
- [ ] 202205

### Approach
#### What is the motivation for this PR?
Add new UT to test and protect watchdog mechanism from code change.

#### How did you do it?
Pause orchagent service with 'kill -stop' command and check if the watchdog can send alert.

#### How did you verify/test it?
Manually test new UT.
Pass PR validation.

#### Any platform specific information?
No

#### Supported testbed topology if it's a new test case?
Any

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants