Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[route_check.py] - update includes checks on subscriptions #1344

Merged
merged 6 commits into from
Jan 11, 2021

Conversation

renukamanavalan
Copy link
Contributor

@renukamanavalan renukamanavalan commented Dec 31, 2020

- What I did
The routes flow from APPL-DB to ASIC-DB, via orchagent.
This tool's job is to verify that all routes added to APPL-DB do
get into ASIC-DB and all routes removed from APPL-DB are deleted from ASIC-DB.

- How I did it
NOTE: The flow from APPL-DB to ASIC-DB takes non zero milliseconds.
1) Initiate subscribe for ASIC-DB updates.
2) Read APPL-DB & ASIC-DB
3) Get the diff.
4) If any diff,
4.1) Collect subscribe messages for a second
4.2) check diff against the subscribe messages
5) Rule out local interfaces & default routes
6) If still outstanding diffs, report failure.

- How to verify it
Run this tool in SONiC switch and watch the result. In case of failure
checkout the result to validate the failure.
To simulate failure:
Stop Orchagent.
Run this tool, and likely you would see some failures.
You could potentially remove / add routes in APPL / ASIC DBs with orchagent
down to ensure failure.
Analyze the reported failures to match expected.
You may use the exit code to verify the result as success or not.

…commodate

the possible latency between APPL & ASIC DB.
Switched to swsscommon & python3
Added unit tests
@renukamanavalan
Copy link
Contributor Author

With the following diff, this can be cherry picked onto 201911.

remanava@BN6-0101-0123-15T1:~/try$ zdiff rt.py route_check.py 
1c1
< #!/usr/bin/env python
---
> #!/usr/bin/env python3
80c80
<     t = ipaddress.ip_address(ip.split("/")[0].decode('utf-8'))
---
>     t = ipaddress.ip_address(ip.split("/")[0])
85c85
<     t = ipaddress.ip_address(ip.split("/")[0].decode('utf-8'))
---
>     t = ipaddress.ip_address(ip.split("/")[0])
remanava@BN6-0101-0123-15T1:~/try$ 

@renukamanavalan
Copy link
Contributor Author

retest this please

@liat-grozovik
Copy link
Collaborator

liat-grozovik commented Jan 4, 2021

@renukamanavalan

  1. Can you please update the PR description according to the template?
    You may check [PFCWD] Fix 'start' pfcwd command #1345 as a reference.

  2. Could you please add more information per function of what it intends to do and in case of inputs/output returns add what to expect?

  3. is there any HLD of this route check? if not, can you please add a description of this route check and how it is to be used?
    I believe there is a requirement this script intend to meet and it will be much more efficient to document it in the code itself if no HLD can be referenced. As there will be no CLI (if I understand it correct) this will not be documented in the command reference guide.

@liat-grozovik
Copy link
Collaborator

Please note that many of the PRs under sonic-utilities are failing on the 'default' and it seems to be general issue with the build system.

@renukamanavalan renukamanavalan changed the title Route check -- update [route_check.py] - update includes checks on subscriptions Jan 5, 2021
2) Read APPL-DB & ASIC-DB
3) Get the diff.
4) If any diff,
4.1) Collect subscribe messages for a second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should you verify it persist for 3 consecutive reads instead of 1 read?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is another scheme. 3 consecutive reads and look for diff that is consistently same across all to imply that as possible failures. But this has a hole if a a route was getting removed & added quickly multiple times, then it could be a false positive.

Here we look for SET/DEL in subscription messages to see, if a missed ASIC entry matches with a SET subscribe message, which would have added to ASIC. Similarly look for DEL message for an unaccounted ASIC entry as that would have removed it.

scripts/route_check.py Show resolved Hide resolved
No logical code change.
@renukamanavalan
Copy link
Contributor Author

retest this please

1 similar comment
@liushilongbuaa
Copy link
Contributor

retest this please

prsunny
prsunny previously approved these changes Jan 8, 2021
Copy link
Contributor

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.. minor comments..

@@ -45,6 +94,12 @@ def set_level(lvl, log_to_syslog):


def print_message(lvl, *args):
"""
print and log the message for give level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given?

print_message(syslog.LOG_ERR, "Failed. Look at reported mismatches above")
return -1
print_message(syslog.LOG_ERR, "add: {", json.dumps(adds, indent=4), "}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking, do we need this add/del json dumps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When failure is reported, the log of add/del during that one sec period, will be very useful for looking in-depth.

@renukamanavalan renukamanavalan merged commit b18ef5a into sonic-net:master Jan 11, 2021
@renukamanavalan renukamanavalan deleted the route_check branch January 11, 2021 19:46
lguohan pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jan 20, 2021
- [route_check.py] - update includes checks on subscriptions (sonic-net/sonic-utilities#1344)
- Validations checks while adding a member to PortChannel and removing a member from a Portchannel (sonic-net/sonic-utilities#1328)
- [show] Add subcommand to show midplane status for modular chassis (sonic-net/sonic-utilities#1267)
- [pytest][qos][config] Added pytests for "config qos reload" commands" (sonic-net/sonic-utilities#1346)
- Drop explict 3 seconds pause between two object updates/deletes. (sonic-net/sonic-utilities#1359)
- [show]fix for show muxcable status by replacing "hostname" to "peer_switch" for deriving tor ipv4_address (sonic-net/sonic-utilities#1360)
- [PFCWD] Fix 'start' pfcwd command (sonic-net/sonic-utilities#1345)

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
lguohan pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jan 21, 2021
- [route_check.py] - update includes checks on subscriptions (sonic-net/sonic-utilities#1344)
- Validations checks while adding a member to PortChannel and removing a member from a Portchannel (sonic-net/sonic-utilities#1328)
- [show] Add subcommand to show midplane status for modular chassis (sonic-net/sonic-utilities#1267)
- [pytest][qos][config] Added pytests for "config qos reload" commands" (sonic-net/sonic-utilities#1346)
- Drop explict 3 seconds pause between two object updates/deletes. (sonic-net/sonic-utilities#1359)
- [show]fix for show muxcable status by replacing "hostname" to "peer_switch" for deriving tor ipv4_address (sonic-net/sonic-utilities#1360)
- [PFCWD] Fix 'start' pfcwd command (sonic-net/sonic-utilities#1345)

Signed-off-by: Shlomi Bitton <shlomibi@nvidia.com>
@abdosi
Copy link
Contributor

abdosi commented Jan 28, 2021

@renukamanavalan Please create PR for 201911. Cherry-pick has conflcit.

@rlhui

@abdosi
Copy link
Contributor

abdosi commented Feb 12, 2021

@renukamanavalan

@renukamanavalan Please create PR for 201911. Cherry-pick has conflcit.

@rlhui

@renukamanavalan please create pr for 201911

renukamanavalan added a commit to renukamanavalan/sonic-utilities that referenced this pull request Feb 12, 2021
…#1344)

Summary: Improve the tool to handle the possible latency between APPL-DB & ASIC-DB by looking at subscription messages.

- What I did
The routes flow from APPL-DB to ASIC-DB, via orchagent.
This tool's job is to verify that all routes added to APPL-DB do
get into ASIC-DB and all routes removed from APPL-DB are deleted from ASIC-DB.

- How I did it
NOTE: The flow from APPL-DB to ASIC-DB takes non zero milliseconds.
1) Initiate subscribe for ASIC-DB updates.
2) Read APPL-DB & ASIC-DB
3) Get the diff.
4) If any diff,
4.1) Collect subscribe messages for a second
4.2) check diff against the subscribe messages
5) Rule out local interfaces & default routes
6) If still outstanding diffs, report failure.

- How to verify it
Run this tool in SONiC switch and watch the result. In case of failure
checkout the result to validate the failure.
To simulate failure:
Stop Orchagent.
Run this tool, and likely you would see some failures.
You could potentially remove / add routes in APPL / ASIC DBs with orchagent
down to ensure failure.
Analyze the reported failures to match expected.
You may use the exit code to verify the result as success or not.
anand-kumar-subramanian pushed a commit to anand-kumar-subramanian/sonic-utilities that referenced this pull request Mar 2, 2021
…#1344)

Summary: Improve the tool to handle the possible latency between APPL-DB & ASIC-DB by looking at subscription messages.

- What I did
The routes flow from APPL-DB to ASIC-DB, via orchagent.
This tool's job is to verify that all routes added to APPL-DB do
get into ASIC-DB and all routes removed from APPL-DB are deleted from ASIC-DB.

- How I did it
NOTE: The flow from APPL-DB to ASIC-DB takes non zero milliseconds.
1) Initiate subscribe for ASIC-DB updates.
2) Read APPL-DB & ASIC-DB
3) Get the diff.
4) If any diff,
4.1) Collect subscribe messages for a second
4.2) check diff against the subscribe messages
5) Rule out local interfaces & default routes
6) If still outstanding diffs, report failure.

- How to verify it
Run this tool in SONiC switch and watch the result. In case of failure
checkout the result to validate the failure.
To simulate failure:
Stop Orchagent.
Run this tool, and likely you would see some failures.
You could potentially remove / add routes in APPL / ASIC DBs with orchagent
down to ensure failure.
Analyze the reported failures to match expected.
You may use the exit code to verify the result as success or not.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants