Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale Counter db table exists even after rif is deleted. #2193

Open
sumanbrcm opened this issue Mar 22, 2022 · 1 comment
Open

Stale Counter db table exists even after rif is deleted. #2193

sumanbrcm opened this issue Mar 22, 2022 · 1 comment

Comments

@sumanbrcm
Copy link
Contributor

sumanbrcm commented Mar 22, 2022

Description
SONiC maintains rif counter statistics in COUNTER_DB with the the help of 4 tables. They are needed for displaying packets/bytes as well as rates in tx/tx direction. These tables are accessed by object-id(OID) for the corresponding RIF. The RIF - OID is maintained in another counter-db table COUNTERS_RIF_NAME_MAP. But once RIF is deleted, although the mapping is deleted , but the 4 tables do not get deleted and remain as stale.

Steps to reproduce the issue:

  1. Create Vlan and make it a member of any port
  2. Assign ip to this vlan
  3. Check counter db and check the oid for this in "COUNTERS_RIF_NAME_MAP"
  4. Check the tables which have got created for the oid.
  5. Delete the ip from vlan (rif gets deleted)
  6. Check the counter db , the tables still exists for those oid.

Describe the results you received:
Upon execution of the steps , after RIF was deleted the stale table still existed in COUNTERS DB. For clarity the exact CLI is also shown as below . After deletion of RIF, the stale table exists which is shown in step 6 .

  1. sudo config vlan add 100
    sudo config vlan member add 100 Ethernet48
  2. sudo config interface ip add Vlan100 100.100.100.1/24
  3. redis-cli -n 2
    HGETALL "COUNTERS_RIF_NAME_MAP"
    127) "Vlan100"
    128) "oid:0x6000000000ba6"
  4. 127.0.0.1:6379[2]> keys *6000000000ba6*
    1) "COUNTERS:oid:0x6000000000ba6"
    2) "RATES:oid:0x6000000000ba6"
    3) "RATES:oid:0x6000000000ba6:RIF"
    127.0.0.1:6379[2]>
  5. sudo config interface ip remove Vlan100 100.100.100.1/24
  6. 127.0.0.1:6379[2]> keys *6000000000ba6*
    1) "COUNTERS:oid:0x6000000000ba6"
    2) "RATES:oid:0x6000000000ba6"
    3) "RATES:oid:0x6000000000ba6:RIF"

Describe the results you expected:
The expectation is there should not be any stale table for any deleted/non-existing RIF.

Output of show version:
admin@sonic:~$ show version

SONiC Software Version: SONiC.master.0-9ac289a88
Distribution: Debian 11.2
Kernel: 5.10.0-8-2-amd64
Build commit: 9ac289a88
Build date: Tue Mar 22 11:08:10 UTC 2022
Built by: sk408115@sonic-lvn-csg-004

Platform: x86_64-accton_as7816_64x-r0
HwSKU: Accton-AS7816-64X
ASIC: broadcom
ASIC Count: 1
Serial Number: N/A
Model Number: N/A
Hardware Revision: N/A
Uptime: 18:49:58 up 33 min, 1 user, load average: 0.15, 0.61, 0.70

Docker images:
REPOSITORY TAG IMAGE ID SIZE
docker-orchagent latest da83f8d4ea98 462MB
docker-orchagent master.0-9ac289a88 da83f8d4ea98 462MB
docker-fpm-frr latest 3ee865fd5b1c 460MB
docker-fpm-frr master.0-9ac289a88 3ee865fd5b1c 460MB
docker-nat latest a69d62fac6ba 445MB
docker-nat master.0-9ac289a88 a69d62fac6ba 445MB
docker-macsec latest 5a6aa61a8668 445MB
docker-macsec master.0-9ac289a88 5a6aa61a8668 445MB
docker-sflow latest 8bf63e3ef39a 443MB
docker-sflow master.0-9ac289a88 8bf63e3ef39a 443MB
docker-teamd latest 1ee6c94c589b 442MB
docker-teamd master.0-9ac289a88 1ee6c94c589b 442MB
docker-platform-monitor latest 5effe289b674 541MB
docker-platform-monitor master.0-9ac289a88 5effe289b674 541MB
docker-snmp latest a6442486551c 471MB
docker-snmp master.0-9ac289a88 a6442486551c 471MB
docker-mux latest 530e7ff28531 480MB
docker-mux master.0-9ac289a88 530e7ff28531 480MB
docker-dhcp-relay latest 3af842692b56 449MB
docker-sonic-telemetry latest 10893f4fc878 528MB
docker-sonic-telemetry master.0-9ac289a88 10893f4fc878 528MB
docker-database latest 2af56fc5bcab 439MB
docker-database master.0-9ac289a88 2af56fc5bcab 439MB
docker-syncd-brcm latest 15125617b736 805MB
docker-syncd-brcm master.0-9ac289a88 15125617b736 805MB
docker-gbsyncd-credo latest f2fb6158c6ba 443MB
docker-gbsyncd-credo master.0-9ac289a88 f2fb6158c6ba 443MB
docker-sonic-mgmt-framework latest 2e24cd014e00 582MB
docker-sonic-mgmt-framework master.0-9ac289a88 2e24cd014e00 582MB
docker-lldp latest 4aab8886498e 467MB
docker-lldp master.0-9ac289a88 4aab8886498e 467MB
docker-router-advertiser latest f4666d5d47ed 427MB
docker-router-advertiser master.0-9ac289a88 f4666d5d47ed 427MB

admin@sonic:~$

Output of show techsupport:

@sumanbrcm
Copy link
Contributor Author

I am submitting a PR for this issue

sumanbrcm added a commit to sumanbrcm/sonic-swss that referenced this issue Mar 25, 2022
sumanbrcm added a commit to sumanbrcm/sonic-swss that referenced this issue Apr 1, 2022
sumanbrcm added a commit to sumanbrcm/sonic-swss that referenced this issue Apr 1, 2022
sumanbrcm added a commit to sumanbrcm/sonic-swss that referenced this issue Apr 1, 2022
prsunny pushed a commit that referenced this issue May 12, 2022
To handle this , whenever RIF is deleted (removeRouterIntfs) , we ensure that these COUNTER_DB tables are deleted(handled in cleanUpRifFromCounterDb) .
preetham-singh pushed a commit to preetham-singh/sonic-swss that referenced this issue Aug 6, 2022
To handle this , whenever RIF is deleted (removeRouterIntfs) , we ensure that these COUNTER_DB tables are deleted(handled in cleanUpRifFromCounterDb) .
sumanbrcm added a commit to sumanbrcm/sonic-swss that referenced this issue Oct 13, 2022
* Fixing issue #11621
* The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue sonic-net#2193 the cleanup for stale rif counters was added.
* But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
* So as a fix all such cleanup has been moved to syncd.

Signed-off-by: suman.kumar@broadcom.com
sumanbrcm added a commit to sumanbrcm/sonic-swss that referenced this issue Oct 13, 2022
* Fixing issue #11621
* The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue sonic-net#2193 the cleanup for stale rif counters was added.
* But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
* So as a fix all such cleanup has been moved to syncd.

Signed-off-by: Suman Kumar <suman.kumar@broadcom.com>
liat-grozovik pushed a commit that referenced this issue Nov 8, 2022
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added.
But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
So as a fix all such cleanup has been moved to syncd.

- What I did
Fix for sonic-net/sonic-buildimage#11621

As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error.
To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition.

- Why I did it

The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit.

- How I verified it

Followed the steps in (sonic-net/sonic-buildimage#11621) :

Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
Remove RIF
Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen)
Also checked cleanup for rif counters.

After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
127) "Vlan100"
128) "oid:0x6000000000aa5"

Checked all the tabled in COUNTER_DB which has same OID in keys
127.0.0.1:6379[2]> keys 6000000000aa5
1) "RATES:oid:0x6000000000aa5:RIF"
2) "COUNTERS:oid:0x6000000000aa5"
3) "RATES:oid:0x6000000000aa5"
127.0.0.1:6379[2]>

Deleted the RIF by removing the ip on the intf.

Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
127.0.0.1:6379[2]> keys 6000000000aa5
(empty array)
127.0.0.1:6379[2]>

Signed-off-by: Suman Kumar <suman.kumar@broadcom.com>
yxieca pushed a commit that referenced this issue Nov 10, 2022
The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue #2193 the cleanup for stale rif counters was added.
But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
So as a fix all such cleanup has been moved to syncd.

- What I did
Fix for sonic-net/sonic-buildimage#11621

As a past fix which aimed at removing stale rif counters (#2199) , there is a chance of race condition and it leads to lua script reporting error.
To handle this , the rif counters cleanup code(handled in cleanUpRifFromCounterDb) is now called from syncd ( removeCounter ) to avoid such race condition.

- Why I did it

The operations in Orchagent and syncd is not synchronous, so while Orchagent deletes the rif counters from Counters Db, the syncd could still access it. In race conditions the lua script trying to fetch rif counters will have errors syslog for such access as it was already deleted by orchagent. The cleanup code is removed from orchagent is added in syncd - it will make sure no such race condition would get hit.

- How I verified it

Followed the steps in (sonic-net/sonic-buildimage#11621) :

Create RIF in SONiC, wait till RIF rates are populated in COUNTERS DB
Remove RIF
Repeat the steps multiple times and check if any error syslog is seen (No error syslog is seen)
Also checked cleanup for rif counters.

After RIF creation derived info of oid for RIF from "COUNTERS_RIF_NAME_MAP"
127) "Vlan100"
128) "oid:0x6000000000aa5"

Checked all the tabled in COUNTER_DB which has same OID in keys
127.0.0.1:6379[2]> keys 6000000000aa5
1) "RATES:oid:0x6000000000aa5:RIF"
2) "COUNTERS:oid:0x6000000000aa5"
3) "RATES:oid:0x6000000000aa5"
127.0.0.1:6379[2]>

Deleted the RIF by removing the ip on the intf.

Checked COUNTER_DB again with same OID if there are stale entries or not. No stale entries exist now.
127.0.0.1:6379[2]> keys 6000000000aa5
(empty array)
127.0.0.1:6379[2]>

Signed-off-by: Suman Kumar <suman.kumar@broadcom.com>
oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this issue Mar 1, 2023
* Fixing issue #11621
* The cleanup code for stale rif counters are now moved to syncd . Earlier as part of fix for issue sonic-net#2193 the cleanup for stale rif counters was added.
* But it could create a race condition between orchagent removes RIF rate counters from DB and lua script fetching them.
* So as a fix all such cleanup has been moved to syncd.

Signed-off-by: Suman Kumar <suman.kumar@broadcom.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant