Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchagent part of Error Monitor feature for code training #1

Closed
wants to merge 2 commits into from

Conversation

Junchao-Mellanox
Copy link
Owner

@Junchao-Mellanox Junchao-Mellanox commented Nov 25, 2019

What I did
Create a new class ErrorMonitorOrch. This class has five major responsibilities:

  1. Subscribe changes in ERROR_MONITOR_THRESHOLD_TABLE. Record TX_ERROR_THRESHOLD and TX_ERROR_STATUS for each port.
  2. Subscribe changes in ERROR_MONITOR_INTERVAL_TABLE. Once POLL_INTERVAL changes, reset the poll timer.
  3. Once timer hit, get TX error counter via SAI, compare it with threshold value and set the TX_STATUS accordingly.
  4. Subscribe channel for the clear event. Once user issues a clear, clear TX_ERROR_NUMBER and TX_STATUS.
  5. Subscribe port change event, once port is created or removed, react properly.

Why I did it
Monitor the tx errors on switch ports, in case the number of errors on one port in a configurable time period reaches the configurable thresholds, port is deemed as “Not OK”, the log with critical severity should be raised and a CLI command should be provide to display the port error status, OK or Not OK.
How I verified it

Details if related

@keboliu
Copy link
Collaborator

keboliu commented Nov 25, 2019

general comments: need to filling the sections "what I did", "why I did it", "how I verified it" and "Detailes if related" if applicable, this will help the reviewer to understand your code.

@@ -54,7 +54,8 @@ orchagent_SOURCES = \
watermarkorch.cpp \
policerorch.cpp \
sfloworch.cpp \
chassisorch.cpp
chassisorch.cpp \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't see this file?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strange. I can see that I added a errormonitororch.cpp in this file locally.

ErrorMonitorContext *context = kv.second;
if (!gPortsOrch->getPort(context->name, port))
{
continue; // should not happen...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should print out some log here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

portStats.data());
if (status != SAI_STATUS_SUCCESS)
{
SWSS_LOG_ERROR("Failed to get stats of port 0x%" PRIx64 ": %d", port.m_port_id, status);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the log print out port alias will be more helpful to understand which port has the error.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

}
catch(const std::invalid_argument &e)
{
SWSS_LOG_ERROR("Invalid tx error threshold value: %s", value.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should perform the validation check in CLI? or how can you notify the user that the configuration is invalid?

Copy link
Owner Author

@Junchao-Mellanox Junchao-Mellanox Nov 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already checked in sonic-utilities. Here is only make sure user would not change the redis DB directly.

}
catch(const std::invalid_argument &e)
{
SWSS_LOG_ERROR("Invalid tx error threshold value: %s", value.c_str());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should perform the validation check in CLI? or how can you notify the user that the configuration is invalid?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Already checked in sonic-utilities. Here is only make sure user would not change the redis DB directly.

Junchao-Mellanox pushed a commit that referenced this pull request Jan 11, 2021
* Fix the stack-overflow issue in AclOrch::getTableById()

In case the table_id and m_mirrorTableId/m_mirrorTableId are empty
The function would be called recursively without ending.

```
If we have some rule that does not have table defined, it will trigger this issue.

(gdb) bt
#0  0x00007f77ac5af3a9 in swss::Logger::write (this=0x7f77ac801ec0 <swss::Logger::getInstance()::m_logger>, prio=swss::Logger::SWSS_DEBUG, fmt=0x7f77ac5f1690 ":> %s: enter")
    at logger.cpp:209
#1  0x00000000004a5792 in AclOrch::getTableById (this=this@entry=0x1c722a0, table_id="") at aclorch.cpp:2617
#2  0x00000000004a59d1 in AclOrch::getTableById (this=this@entry=0x1c722a0, table_id="") at aclorch.cpp:2634
#3  0x00000000004a59d1 in AclOrch::getTableById (this=this@entry=0x1c722a0, table_id="") at aclorch.cpp:2634
...
#20944 0x00000000004a59d1 in AclOrch::getTableById (this=this@entry=0x1c722a0, table_id="") at aclorch.cpp:2634
#20945 0x00000000004ad3ce in AclOrch::doAclRuleTask (this=this@entry=0x1c722a0, consumer=...) at aclorch.cpp:2437
#20946 0x00000000004b04cd in AclOrch::doTask (this=0x1c722a0, consumer=...) at aclorch.cpp:2141
#20947 0x00000000004231b2 in Orch::doTask (this=0x1c722a0) at orch.cpp:369
#20948 0x000000000041c4e9 in OrchDaemon::start (this=this@entry=0x1c19960) at orchdaemon.cpp:376
#20949 0x0000000000409ffc in main (argc=<optimized out>, argv=0x7ffe2e392d68) at main.cpp:295

(gdb) p table_id
$1 = ""
(gdb) p m_mirrorTableId
$2 = ""
(gdb) p m_mirrorV6TableId
$3 = ""
```

Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
@Junchao-Mellanox Junchao-Mellanox deleted the error-monitor branch October 29, 2021 01:45
Junchao-Mellanox pushed a commit that referenced this pull request Jul 1, 2022
Currently, ASAN sometimes reports the BufferOrch::m_buffer_type_maps and QosOrch::m_qos_maps as leaked. However, their lifetime is the lifetime of a process so they are not really 'leaked'.
This also adds a simple way to add more suppressions later if required.

Example of ASAN report:

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7f96aa952d30 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xead30)
    #1 0x55ca1da9f789 in __static_initialization_and_destruction_0 /__w/2/s/orchagent/bufferorch.cpp:39
    #2 0x55ca1daa02af in _GLOBAL__sub_I_bufferorch.cpp /__w/2/s/orchagent/bufferorch.cpp:1321
    #3 0x55ca1e2a9cd4  (/usr/bin/orchagent+0xe89cd4)

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7f96aa952d30 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xead30)
    #1 0x55ca1da6d2da in __static_initialization_and_destruction_0 /__w/2/s/orchagent/qosorch.cpp:80
    #2 0x55ca1da6ecf2 in _GLOBAL__sub_I_qosorch.cpp /__w/2/s/orchagent/qosorch.cpp:2000
    #3 0x55ca1e2a9cd4  (/usr/bin/orchagent+0xe89cd4)

- What I did
Added an lsan suppression config with static variable leak suppression

- Why I did it
To suppress ASAN false positives

- How I verified it
Run a test that produces the static variable leaks report and checked that report has these leaks suppressed.

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Junchao-Mellanox pushed a commit that referenced this pull request Sep 19, 2022
…onic-net#2446)

* Add events publish

* Added header file

* signature fix

* syntax

* syntax

* syntax

* syntax

* syntax

* syntax

* Updated fake code

* Remove if and log messages for event_publish

* Remove if and log messages for event_publish (#1)

* Remove event_handle_t from signature and add globally

* Remove extern orchdaemon.cpp

* Revert unneeded changes

Co-authored-by: zbud-msft <zainbudhwani@microsoft.com>
Co-authored-by: Zain Budhwani <99770260+zbud-msft@users.noreply.github.com>
Junchao-Mellanox pushed a commit that referenced this pull request Mar 7, 2023
Currently, ASAN sometimes reports the BufferOrch::m_buffer_type_maps and QosOrch::m_qos_maps as leaked. However, their lifetime is the lifetime of a process so they are not really 'leaked'.
This also adds a simple way to add more suppressions later if required.

Example of ASAN report:

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7f96aa952d30 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xead30)
    #1 0x55ca1da9f789 in __static_initialization_and_destruction_0 /__w/2/s/orchagent/bufferorch.cpp:39
    #2 0x55ca1daa02af in _GLOBAL__sub_I_bufferorch.cpp /__w/2/s/orchagent/bufferorch.cpp:1321
    #3 0x55ca1e2a9cd4  (/usr/bin/orchagent+0xe89cd4)

Direct leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7f96aa952d30 in operator new(unsigned long) (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xead30)
    #1 0x55ca1da6d2da in __static_initialization_and_destruction_0 /__w/2/s/orchagent/qosorch.cpp:80
    #2 0x55ca1da6ecf2 in _GLOBAL__sub_I_qosorch.cpp /__w/2/s/orchagent/qosorch.cpp:2000
    #3 0x55ca1e2a9cd4  (/usr/bin/orchagent+0xe89cd4)

- What I did
Added an lsan suppression config with static variable leak suppression

- Why I did it
To suppress ASAN false positives

- How I verified it
Run a test that produces the static variable leaks report and checked that report has these leaks suppressed.

Signed-off-by: Yakiv Huryk <yhuryk@nvidia.com>
Junchao-Mellanox pushed a commit that referenced this pull request May 6, 2024
**What I did**

Fix the Mem Leak by moving the raw pointers in type_maps to use smart pointers

**Why I did it**

```
Indirect leak of 83776 byte(s) in 476 object(s) allocated from:
    #0 0x7f0a2a414647 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
    #1 0x5555590cc923 in __gnu_cxx::new_allocator, std::allocator > const, referenced_object> > >::allocate(unsigned long, void const*) /usr/include/c++/10/ext/new_allocator.h:115
    #2 0x5555590cc923 in std::allocator_traits, std::allocator > const, referenced_object> > > >::allocate(std::allocator, std::allocator > const, referenced_object> > >&, unsigned long) /usr/include/c++/10/bits/alloc_traits.h:460
    #3 0x5555590cc923 in std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_get_node() /usr/include/c++/10/bits/stl_tree.h:584
    #4 0x5555590cc923 in std::_Rb_tree_node, std::allocator > const, referenced_object> >* std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_create_node, std::allocator > const&>, std::tuple<> >(std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:634
    #5 0x5555590cc923 in std::_Rb_tree_iterator, std::allocator > const, referenced_object> > std::_Rb_tree, std::allocator >, std::pair, std::allocator > const, referenced_object>, std::_Select1st, std::allocator > const, referenced_object> >, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::_M_emplace_hint_unique, std::allocator > const&>, std::tuple<> >(std::_Rb_tree_const_iterator, std::allocator > const, referenced_object> >, std::piecewise_construct_t const&, std::tuple, std::allocator > const&>&&, std::tuple<>&&) /usr/include/c++/10/bits/stl_tree.h:2461
    #6 0x5555590e8757 in std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >::operator[](std::__cxx11::basic_string, std::allocator > const&) /usr/include/c++/10/bits/stl_map.h:501
    #7 0x5555590d48b0 in Orch::setObjectReference(std::map, std::allocator >, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*, std::less, std::allocator > >, std::allocator, std::allocator > const, std::map, std::allocator >, referenced_object, std::less, std::allocator > >, std::allocator, std::allocator > const, referenced_object> > >*> > >&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&, std::__cxx11::basic_string, std::allocator > const&) orchagent/orch.cpp:450
    #8 0x5555594ff66b in QosOrch::handleQueueTable(Consumer&, std::tuple, std::allocator >, std::__cxx11::basic_string, std::allocator >, std::vector, std::allocator >, std::__cxx11::basic_string, std::allocator > >, std::allocator, std::allocator >, std::__cxx11::basic_string, std::allocator > > > > >&) orchagent/qosorch.cpp:1763
    #9 0x5555594edbd6 in QosOrch::doTask(Consumer&) orchagent/qosorch.cpp:2179
    #10 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:241
    #11 0x5555590c8743 in Consumer::drain() orchagent/orch.cpp:238
    #12 0x5555590c8743 in Consumer::execute() orchagent/orch.cpp:235
    #13 0x555559090dad in OrchDaemon::start() orchagent/orchdaemon.cpp:755
    #14 0x555558e9be25 in main orchagent/main.cpp:766
    #15 0x7f0a299b6d09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants