Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][static-route] Static Route BFD HLD document #1216

Merged
merged 54 commits into from
May 18, 2023

Conversation

baorliu
Copy link
Contributor

@baorliu baorliu commented Jan 14, 2023

1, BfdRouteMgr design to monitor static route nexthop reachability and update static route based on BFD session state.
2, testcases

Repo PR title State
sonic-buildimage[Master] staticroutebfd preocess implementation  GitHub issue/pull request detail
sonic-buildimage[202205] staticroutebfd preocess implementation  GitHub issue/pull request detail
sonic-swss[master] traffic convergence acceleration for staticroutebfd  GitHub issue/pull request detail
sonic-swss[202205] traffic convergence acceleration for staticroutebfd  GitHub issue/pull request detail

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 14, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

Follow the StaticRouteTimer style, rename the component to StaticRouteBfd.
StaticRouteBfd does NOT use bgpcfgd/manager framework
change the "refreshable" field name to "expiry", because StaticRouteTimer already implemented the required feature with field name "expiry"
@zhangyanzhao
Copy link
Collaborator

@baorliu can you please help to add the code PRs by referring to #806? Thanks.

@lguohan
Copy link
Contributor

lguohan commented Mar 27, 2023

do we have test for this?

@baorliu
Copy link
Contributor Author

baorliu commented Apr 3, 2023

do we have test for this?

Yes. The code (PR13789 and PR 13764) has passed the following tests (from the test section in this document):
Non-Crash Test cases:
Test static route config with "bfd"="true" (passed)
Test static route config update with "bfd"="true" (passed)
Test static route ("bfd"="true") removed from config (passed)
Test 2 static routes ("bfd"="true") share same nexthop (passed)
Test BFD session DOWN causing static route updated/removed (passed)
Test BFD session UP causing static route updated/reinstalled (passed)
Crash/Restart Test Cases:
Restart StaticRouteBfd between static route config and BFD state update (passed)
Restart StaticRouteBfd between static route adding/deleting (passed)

To avoid StaticRouteTimer deleting StaticRouteBfd created static route entry in appl_db, a new field "expiry" is introduced in appl_db STATIC_ROUTE_TABLE schema. StaticRouteBfd sets "expiry"="false" when it writes a static route to appl_db STATIC_ROUTE_TABLE. When StaticRouteTimer see "expiry"="false", it ignores this static route entry. <br>


<img src="static_rt_bfd_overview.png" width="500">
Copy link
Collaborator

@venkatmahalingam venkatmahalingam Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I assume red color coded are newly introduced modules/interactions? if no, please color code accordingly. There are some typos (e.g bd, bfd stete 1). in the diagram, please fix that as well.

Copy link
Collaborator

@venkatmahalingam venkatmahalingam Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image
I think, only persistent static routes are pushed to config-db and this route will be configured to FRR and handled back as regular route via fpmsyncd for HW programming. staticroutebfd handles only the persistent static route case? how about non-persistent static route? we should handle static route entries from both config-db (persistent) and app-db (non-persistent) in staticroutebfd module.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add the details for SAI BFD offload APIs being used for this feature.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the diagram: bfd and expiry are newly added fields. the interactions are color coded (see color code at top-right corner). fixed typos. Thanks.

Copy link
Contributor Author

@baorliu baorliu Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The target of this design is to handle static routes entry from config_db only.
If there is requirement for non-persistent static route, it could be the feature enhancement in the future.
non-persistent static route case is a more complicated scenario (it requires the client be aware of new "bfd" field, etc ). Because that StaticRouteBfd also need to write static route to AppDB to let StaticRouteMgr (appl_db) to install the route to FRR, so it may need to use different key for BFD enabled route, otherwise the static route entry will be overwritten if use same key to write to same table.
so this design support the static route in config_db only.

1. find the interface name for the nexthop
2. lookup the config_db, find the IP address of that interface
3. choose the IP address which has same type of nexthop (IPv4 or IPv6)
4. If no IP address found, use local loopback IP address, and log a warning message for that. <br /><br />
Copy link
Collaborator

@venkatmahalingam venkatmahalingam Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if there are more than one local loopback address? suggest to log a warning message when local IP addresses are not found, we should be explicit if possible instead of deciding the source IP internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently it looks up 2 loopback ip addresses:
LOOPBACK0_NAME = "Loopback0"
LOOPBACK4096_NAME = "Loopback4096"
(see the code PR, the link is in this PR description)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay as long as this is assumed by bgpcfgd and works for all cases.

4. If no IP address found, use local loopback IP address, and log a warning message for that. <br /><br />

In the example (assume the "bfd" field is "true"), for the nexthop "20.0.10.3", the corresponding interface name is PortChannel10.
From the interface configuration, we can found two IP addresses, an IPv4 address 20.0.10.1 and an IPv6 address 2603:10E2:400:10::1. we choose 20.0.10.1 because the nexthop IP address is IPv4 address. <br /><br />
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, when there are more than one IPv4 or IPv6 addresses, we should choose based on subnet match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note at the beginning of this document to limit the number of IP address (one IPv4 IP address or one IPv6 IP address) for static route BFD case. The reason is that, BFD need to work in pair, two BFD sessions in local system and remote system need to talk to each other to get BFD sessions UP, remote system's peer IP address need to be same as local system's local_addr (the IP assigned to the interface in this static route BFD case). If there are more than one IPv4 (or IPv6) addresses, it will be confusing when config the remote system, and the BFD session could not be UP if the IP address mismatches.


Two optional fields are introduced to STATIC_ROUTE_TABLE: <br>
1. "bfd" -- Default value is false when this field is not configured
2. "expiry" -- Default value is true when this field is not configured
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need this "expiry" field, as we expect to have only non-persistent STATIC_ROUTE_TABLE in the APPL_DB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is being used only for non-persistent route but in this HLD we focus only persistent route (CONFIG_DB) which means there should not be any STATIC_ROUTE_TABLE in the APPL_DB or in other words, we should not worry about STATIC_ROUTE_TABLE presence in the APPL_DB, do we use STATIC_ROUTE_TABLE present in APPL_DB for something?

I thought, we could send only the BFD_SESSION information all the way to SAI via APPL_DB & ASIC_DB and get back the notification if the session is down/up for us to react in the StaticRouteBfd module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process is that, StaticRouteBfd read static route from config_db (persistent), and write the processed route to appl_db STATIC_ROUTE_TABLE, to utilize StaticRouteMgr(appl_db) to install the route to FRR. so the processed route looks exact same as non-persistent route (to StaticRouteMgr(appl_db)). that is the reason we set expiry=false to avoid StaticRouteTimer deleting it.

for your last comments, the bfdorch is already in the system to provide bfd service, write request to appl_db BFD_SESSION_TABLE can utilize the current bfdorch service. bfdorch maintains local_discriminator and BFD sessions in sonic system, other service should not create BFD session by directly calling SAI API.

1. TABLE_CONFIG: config_db STATIC_ROUTE_TABLE cache (for the route with "bfd"="true" only)
2. TABLE_NEXTHOP: different prefixes may have same nexthop. This table is used to track which prefix is using the nexthop.
3. TABLE_BFD: bfd session created by StaticRouteBfd. The contents are part of appl_db BFD_SESSION_TABLE (for the session its peer IP is in nexthop table). *Assumption: BFD session is not shared with other applications*
4. TABLE_SR: the static routes written to appl_db STATIC_ROUTE_TABLE by BfdRouteManager (with "expiry"="false"). It's nexthop list might be different from the configuration depends on BFD session state.<br>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to STATIC_ROUTE instead of SR as it conflicts with segment routing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the table name to TABLE_SRT (avoiding SR, but want to keep the name short).


### Start event processing
Start event processing after the above table and redis DB reconciliation.<br>
In the rare case that, data were read from DB but also get an event later, cause rewrite BFD session or Static Route to appl_DB with same contents should not be a problem.
Copy link
Collaborator

@venkatmahalingam venkatmahalingam Apr 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add sections for the new tables schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to create a PR to update the schema on the original static route document?
https://github.com/sonic-net/SONiC/blob/master/doc/static-route/SONiC_static_route_hdl.md#3211-static_route
There is a link to the above document.
if the schema in two different documents, need to update both documents to keep them in sync.

baorliu added 16 commits May 14, 2023 20:05
Follow the StaticRouteTimer style, rename the component to StaticRouteBfd.
StaticRouteBfd does NOT use bgpcfgd/manager framework
change the "refreshable" field name to "expiry", because StaticRouteTimer already implemented the required feature with field name "expiry"
change table name TABLE_SR to TABLE_SRT  (static route). SR seems like "segment routing" related.
Added a new section to describe the design to support the bfd flag changing at runtime.
Change the logic when bfd flag changes from false to true, to not interrupt the traffic.
update bfd field dynamic change design.
the redis events are not reliable from T2 end-to-end test result.
when there is consecutive write to a specific key, event there is 0.1 second delay, the consumer side may miss the redis "set" event.
Here is the log:
May  4 22:47:22.341016 sfd-lt2-lc2 NOTICE bgp0#bgpcfgd: :- pops: Miss table key STATIC_ROUTE:default:3.3.3.3/32, possibly outdated
don't need to check vnet prefix, remove that step
@zhangyanzhao zhangyanzhao force-pushed the static_rt_bfd_baorliu branch from d575bd3 to f527967 Compare May 15, 2023 03:05
abdosi
abdosi previously approved these changes May 15, 2023
@abdosi
Copy link
Contributor

abdosi commented May 15, 2023

@prsunny can you please help with review/sign-off this

prsunny
prsunny previously approved these changes May 16, 2023
Copy link
Contributor

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please fix typos

<br>
<br>

## Dynamc static route "bfd" config chagne support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change - typo

<br>

## Dynamc static route "bfd" config chagne support
When a static route "bfd" field changes (from "true" to "false", or from "false" to "true"), the owner(StaticRouteMgr or StaticRouteBfd) of this static routes also changes. StaticRouteMgr and StaticRouteBfd need to work together to handle this runtime ownershop change.<br>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ownership - typo


## Dynamc static route "bfd" config chagne support
When a static route "bfd" field changes (from "true" to "false", or from "false" to "true"), the owner(StaticRouteMgr or StaticRouteBfd) of this static routes also changes. StaticRouteMgr and StaticRouteBfd need to work together to handle this runtime ownershop change.<br>
When StaticRouteMgr gets a static route update (redis SET event) event, it checks the static route "bfd" field and its local cache. If the "bfd" field is "true" and the prefix is in its local cache(it handles this route before the update), the StaticRouteMgr delete it from its local cache and does not do any furhter processing for this route.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

further - typo

@baorliu baorliu dismissed stale reviews from prsunny and abdosi via fab5b55 May 16, 2023 20:36
@abdosi abdosi merged commit 395d525 into sonic-net:master May 18, 2023
prsunny pushed a commit to sonic-net/sonic-buildimage that referenced this pull request May 26, 2023
* [BFD] staticroutebfd implementation
* To enable the BFD for static route

HLD: sonic-net/SONiC#1216
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants