Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add orchagent heart beat message for watchdog. #2737

Merged
merged 9 commits into from
Jun 6, 2023

Conversation

liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Apr 17, 2023

What I did
Improve orch agent: output heartbeat message to systemd.

Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

How I verified it
Pass all UT.
Manually validate the heartbeat message works correctly.

Details if related
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306

@liuh-80 liuh-80 changed the title Add orchagent heart beat message [POC] Add orchagent heart beat message Apr 17, 2023
@liuh-80 liuh-80 requested a review from qiluo-msft April 28, 2023 10:16
@liuh-80 liuh-80 changed the title [POC] Add orchagent heart beat message Add orchagent heart beat message for watchdog. Apr 28, 2023
@liuh-80 liuh-80 marked this pull request as ready for review April 28, 2023 10:18
@liuh-80 liuh-80 requested a review from prsunny as a code owner April 28, 2023 10:18
@@ -958,6 +963,20 @@ void OrchDaemon::addOrchList(Orch *o)
m_orchList.push_back(o);
}

void OrchDaemon::heartBeat(std::chrono::time_point<std::chrono::high_resolution_clock> tcurrent)
{
static auto tlast = std::chrono::high_resolution_clock::now();
Copy link
Contributor

@qiluo-msft qiluo-msft May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static

You are assuming OrchDaemon has only single instance in the process. To be super safe, you can use a static member variable instead of a static function variable. #Closed

Copy link
Contributor Author

@liuh-80 liuh-80 May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, change to a static member variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for misleading.

To be super safe, you can use a static member variable instead of a static function variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, change to none static member.

@@ -64,6 +68,8 @@ event_handle_t g_events_handle;
#define DEFAULT_MAX_BULK_SIZE 1000
size_t gMaxBulkSize = DEFAULT_MAX_BULK_SIZE;

std::chrono::time_point<std::chrono::high_resolution_clock> OrchDaemon::m_lastHeartBeat = std::chrono::high_resolution_clock::now();
Copy link
Contributor

@qiluo-msft qiluo-msft May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_lastHeartBeat

std::chrono::time_pointstd::chrono::high_resolution_clock OrchDaemon::m_lastHeartBeat = std::chrono::high_resolution_clock::now();


If not static, should it be initialized in ctor? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, initialized in ctor

@liuh-80 liuh-80 merged commit 99a2a26 into sonic-net:master Jun 6, 2023
qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 6, 2023
…ave issue. (#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 13, 2023
…ave issue. (#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 20, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants