Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNMP-SUBAGENT] [202012] Observed a unusual spike in CPU usage because of blocking=True argument #8202

Closed
vivekrnv opened this issue Jul 16, 2021 · 2 comments · Fixed by sonic-net/sonic-swss-common#505
Labels
Triaged this issue has been triaged

Comments

@vivekrnv
Copy link
Contributor

Description

Steps to reproduce the issue:

  1. Load any 202012 image which was was built with this change PR
  2. Disable counterpoll, if not already disabled and reload the switch
  3. Wait for the SNMP Container to be up.

Describe the results you received:

Check the CPU usage of the SNMP docker and snmp-subagent process

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
24b286433cb7        mgmt-framework      0.01%               108.2MiB / 7.689GiB   1.37%               0B / 0B             26.5MB / 69.6kB     19
78e05cfe9404        snmp                100.66%             68.76MiB / 7.689GiB   0.87%               0B / 0B             0B / 139kB          9
f88c433f5867        telemetry           0.89%               141.4MiB / 7.689GiB   1.80%               0B / 0B             36MB / 86kB         34
36437a3448cd        dhcp_relay          0.03%               30.47MiB / 7.689GiB   0.39%               0B / 0B             795kB / 73.7kB      6
ea34908916a6        radv                0.03%               29.96MiB / 7.689GiB   0.38%               0B / 0B             561kB / 77.8kB      6
dcdd0f45efb8        pmon                0.94%               212MiB / 7.689GiB     2.69%               0B / 0B             72.4MB / 1.21MB     16
8fa306e016cc        lldp                6.52%               59.86MiB / 7.689GiB   0.76%               0B / 0B             7.14MB / 127kB      11
e010a19edc0b        syncd               4.44%               813.2MiB / 7.689GiB   10.33%              0B / 0B             34.9MB / 69.6kB     41
6fc14c869be8        teamd               0.04%               33MiB / 7.689GiB      0.42%               0B / 0B             5.78MB / 90.1kB     12
fedda16a902e        swss                0.13%               54.29MiB / 7.689GiB   0.69%               0B / 0B             17.4MB / 246kB      37
84aaea73fd3f        bgp                 0.05%               71.48MiB / 7.689GiB   0.91%               0B / 0B             16.8MB / 5.83MB     22
29fbbe40ea00        database            5.78%               93.08MiB / 7.689GiB   1.18%               0B / 0B             40.8MB / 98.3kB     12

admin@sonic:~$ top
top - 08:06:58 up  4:47,  2 users,  load average: 2.00, 1.67, 1.13
Tasks: 306 total,   2 running, 302 sleeping,   0 stopped,   2 zombie
%Cpu(s): 15.9 us,  2.4 sy,  0.0 ni, 81.5 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :   7873.9 total,   2911.0 free,   3284.7 used,   1678.2 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   4232.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  74572 root      20   0  153692  57784  13900 R 100.0   0.7  12:56.35 python3
   7819 root      20   0 1330816 822260  46216 S  24.2  10.2  49:31.42 sx_sdk

admin@sonic:~$ ps -aux | grep 74572
root       74572 99.1  0.7 153692 57784 pts/0    Rl   07:53  13:29 python3 -m sonic_ax_impl
admin      75908  0.0  0.0   6692   828 pts/1    S+   08:07   0:00 grep 74572

After doing some profiling,

GIL: 0.00%, Active: 100.00%, Threads: 2

  %Own   %Total  OwnTime  TotalTime  Function (filename:line)
100.00% 100.00%   319.0s    319.1s   get_all (swsscommon/swsscommon.py:1661)
  0.00%   0.00%    2.89s     2.89s   update_rem_if_mgmt (sonic_ax_impl/mibs/ieee802_1ab.py:542)

It is very clear that the get_all method of swsscommon.Sonicv2Connector is taking a significant amount of time, when used with blocking set to True (waits for around 60 sec if the data is not present in the Redis). And this seems to busy wait as inferred from the logic here LINK and thus the reason for CPU Usage spike.

This behavior is particular seen after that commit because, that commit essentially delays the initialization of FLEX_COUNTERS and thereby the COUNTERS:oid* until the counterpoll is enabled.

But the MIB classes in the snmp-subagent periodically poll these DB with blocking set to True.

As a fix, we can set the blocking set to False in all these MIB classes. This shouldn't affect how they operate as they periodically poll the redis anyway and if the data is missed at this instant, that can eventually be fetched within the next 3 to 7 second. {5 is the default period with a random delay of -2,2}

Describe the results you expected:

Shouldn't see the CPU Usage spike.

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@qiluo-msft
Copy link
Collaborator

How long is the "CPU Usage spike" starting from process invoke? If the timeframe is reasonable, does snmpagent work well after that as before?

@zhangyanzhao
Copy link
Collaborator

@qiluo-msft is checking

@zhangyanzhao zhangyanzhao added the Triaged this issue has been triaged label Jul 21, 2021
qiluo-msft added a commit to sonic-net/sonic-swss-common that referenced this issue Jul 22, 2021
Fixes sonic-net/sonic-buildimage#8202
- Fix pubsub channel timeout unit
- Reset context err before next redisGetReply, otherwise later redisGetReply early return on no data, and burn CPU core in a loop.
qiluo-msft added a commit to sonic-net/sonic-swss-common that referenced this issue Jul 28, 2021
Fixes sonic-net/sonic-buildimage#8202
- Fix pubsub channel timeout unit
- Reset context err before next redisGetReply, otherwise later redisGetReply early return on no data, and burn CPU core in a loop.
judyjoseph pushed a commit to sonic-net/sonic-swss-common that referenced this issue Aug 17, 2021
Fixes sonic-net/sonic-buildimage#8202
- Fix pubsub channel timeout unit
- Reset context err before next redisGetReply, otherwise later redisGetReply early return on no data, and burn CPU core in a loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants