-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zebra fpm backpressure #16220
Zebra fpm backpressure #16220
Conversation
5aabbcd
to
69981f9
Compare
", holding off work", | ||
__func__, prov->dp_name, curr); | ||
counter = 0; | ||
} else if (out_curr >= (uint64_t)limit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can it be a situation where the input and output queue is hitting the limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but it doesn't matter from my perspective. Both cases we do the same thing. It's sufficient to just state the first one encountered in a debug and move on.
} | ||
} | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err. I intentionally added a newline to space things out a bit. So I am a bit confused
zebra/zebra_dplane.c
Outdated
* Let's limit the work to how what can be put on the | ||
* in or out queue without going over | ||
*/ | ||
limit -= (curr > out_curr) ? curr : out_curr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: limit -= max(curr, out_curr);
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
zebra/zebra_dplane.c
Outdated
* in or out queue without going over | ||
*/ | ||
limit -= (curr > out_curr) ? curr : out_curr; | ||
if ((uint32_t)limit != zdplane_info.dg_updates_per_cycle) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dg_updates_per_cycle
is uint32
, do we need a cast here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limit is not a uint32_t and as such it will complain about the size comparisons
if (curr >= (uint64_t)limit) { | ||
reschedule = true; | ||
if (IS_ZEBRA_DEBUG_DPLANE_DETAIL) | ||
zlog_debug("%s: Next Provider(%s) Input queue is %" PRIu64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, printing next_prov->dp_name
would be useful here, otherwise hard to distinguish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are printing it?
3261d90
to
490cfa2
Compare
…rns (#19717) Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns How I did it New patches that were added: Patch FRR Pull request 0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch FRRouting/frr#15411 0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch FRRouting/frr#15524 0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch FRRouting/frr#15326 0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch FRRouting/frr#15524 0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch FRRouting/frr#15524 0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch FRRouting/frr#15624 0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch FRRouting/frr#15728 0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch FRRouting/frr#15727 0038-zebra-Actually-display-I-O-buffer-sizes.patch FRRouting/frr#15708 0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch FRRouting/frr#15769 0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch FRRouting/frr#16034 0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch FRRouting/frr#16035 0042-zebra-Use-built-in-data-structure-counter.patch FRRouting/frr#16221 0043-zebra-Use-the-ctx-queue-counters.patch FRRouting/frr#16220 0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch FRRouting/frr#16220 0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch FRRouting/frr#16220 0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch FRRouting/frr#16220 0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch FRRouting/frr#16234 0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch FRRouting/frr#16368 0049-bgpd-backpressure-Improve-debuggability.patch FRRouting/frr#16368 0050-bgpd-backpressure-Avoid-use-after-free.patch FRRouting/frr#16437 0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch FRRouting/frr#16416 0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch FRRouting/frr#16416
…rns (sonic-net#19717) Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns How I did it New patches that were added: Patch FRR Pull request 0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch FRRouting/frr#15411 0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch FRRouting/frr#15524 0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch FRRouting/frr#15326 0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch FRRouting/frr#15524 0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch FRRouting/frr#15524 0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch FRRouting/frr#15624 0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch FRRouting/frr#15728 0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch FRRouting/frr#15727 0038-zebra-Actually-display-I-O-buffer-sizes.patch FRRouting/frr#15708 0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch FRRouting/frr#15769 0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch FRRouting/frr#16034 0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch FRRouting/frr#16035 0042-zebra-Use-built-in-data-structure-counter.patch FRRouting/frr#16221 0043-zebra-Use-the-ctx-queue-counters.patch FRRouting/frr#16220 0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch FRRouting/frr#16220 0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch FRRouting/frr#16220 0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch FRRouting/frr#16220 0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch FRRouting/frr#16234 0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch FRRouting/frr#16368 0049-bgpd-backpressure-Improve-debuggability.patch FRRouting/frr#16368 0050-bgpd-backpressure-Avoid-use-after-free.patch FRRouting/frr#16437 0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch FRRouting/frr#16416 0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch FRRouting/frr#16416
490cfa2
to
7bb5cb7
Compare
7bb5cb7
to
5eda763
Compare
The ctx queue data structures already have a counter associated with them. Let's just use them instead. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently when the dplane_thread_loop is run, it moves contexts from the dg_update_list and puts the contexts on the input queue of the first provider. This provider is given a chance to run and then the items on the output queue are pulled off and placed on the input queue of the next provider. Rinse/Repeat down through the entire list of providers. Now imagine that we have a list of multiple providers and the last provider is getting backed up. Contexts will end up sticking in the input Queue of the `slow` provider. This can grow without bounds. This is a real problem when you have a situation where an interface is flapping and an upper level protocol is sending a continous stream of route updates to reflect the change in ecmp. You can end up with a very very large backlog of contexts. This is bad because zebra can easily grow to a very very large memory size and on restricted systems you can run out of memory. Fortunately for us, the MetaQ already participates with this process by not doing more route processing until the dg_update_list goes below the working limit of dg_updates_per_cycle. Thus if FRR modifies the behavior of this loop to not move more contexts onto the input queue if either the input queue or output queue of the next provider has reached this limit. FRR will naturaly start auto handling backpressure for the dplane context system and memory will not go out of control. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dplane providers have a concept of input queues and output queues. These queues are chained together during normal operation. The code in zebra also has a feedback mechanism where the MetaQ will not run when the first input queue is backed up. Having the dplane_fpm_nl code grab all contexts when it is backed up prevents this system from behaving appropriately. Modify the code to not add to the dplane_fpm_nl's internal queue when it is already full. This will allow the backpressure to work appropriately in zebra proper. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The show zebra dplane provider command was ommitting the input and output queues to the dplane itself. It would be nice to have this insight as well. New output: r1# show zebra dplane providers dataplane Incoming Queue from Zebra: 100 Zebra dataplane providers: Kernel (1): in: 6, q: 0, q_max: 3, out: 6, q: 14, q_max: 3 dplane_fpm_nl (2): in: 6, q: 10, q_max: 3, out: 6, q: 0, q_max: 3 dataplane Outgoing Queue to Zebra: 43 r1# Signed-off-by: Donald Sharp <sharpd@nvidia.com>
5eda763
to
98b11de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me now, thanks
…rns (sonic-net#19717) Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns How I did it New patches that were added: Patch FRR Pull request 0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch FRRouting/frr#15411 0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch FRRouting/frr#15524 0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch FRRouting/frr#15326 0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch FRRouting/frr#15524 0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch FRRouting/frr#15524 0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch FRRouting/frr#15624 0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch FRRouting/frr#15728 0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch FRRouting/frr#15727 0038-zebra-Actually-display-I-O-buffer-sizes.patch FRRouting/frr#15708 0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch FRRouting/frr#15769 0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch FRRouting/frr#16034 0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch FRRouting/frr#16035 0042-zebra-Use-built-in-data-structure-counter.patch FRRouting/frr#16221 0043-zebra-Use-the-ctx-queue-counters.patch FRRouting/frr#16220 0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch FRRouting/frr#16220 0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch FRRouting/frr#16220 0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch FRRouting/frr#16220 0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch FRRouting/frr#16234 0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch FRRouting/frr#16368 0049-bgpd-backpressure-Improve-debuggability.patch FRRouting/frr#16368 0050-bgpd-backpressure-Avoid-use-after-free.patch FRRouting/frr#16437 0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch FRRouting/frr#16416 0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch FRRouting/frr#16416
…rns (#19717) Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns How I did it New patches that were added: Patch FRR Pull request 0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch FRRouting/frr#15411 0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch FRRouting/frr#15524 0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch FRRouting/frr#15326 0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch FRRouting/frr#15524 0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch FRRouting/frr#15524 0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch FRRouting/frr#15624 0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch FRRouting/frr#15728 0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch FRRouting/frr#15727 0038-zebra-Actually-display-I-O-buffer-sizes.patch FRRouting/frr#15708 0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch FRRouting/frr#15769 0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch FRRouting/frr#16034 0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch FRRouting/frr#16035 0042-zebra-Use-built-in-data-structure-counter.patch FRRouting/frr#16221 0043-zebra-Use-the-ctx-queue-counters.patch FRRouting/frr#16220 0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch FRRouting/frr#16220 0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch FRRouting/frr#16220 0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch FRRouting/frr#16220 0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch FRRouting/frr#16234 0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch FRRouting/frr#16368 0049-bgpd-backpressure-Improve-debuggability.patch FRRouting/frr#16368 0050-bgpd-backpressure-Avoid-use-after-free.patch FRRouting/frr#16437 0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch FRRouting/frr#16416 0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch FRRouting/frr#16416
Apply the following patch to dplane_fpm_sonic: * Zebra fpm backpressure (FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
Apply the following patch to dplane_fpm_sonic: * Zebra fpm backpressure (FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
Apply the following patch to dplane_fpm_sonic: * Zebra fpm backpressure (FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
…#21146) Why I did it Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously How I did it Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic: zebra: Use built in data structure counter (zebra: Use built in data structure counter FRRouting/frr#16221) Zebra fpm backpressure (Zebra fpm backpressure FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
<!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md ** Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously #### How I did it Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic: * zebra: Use built in data structure counter (FRRouting/frr#16221) * Zebra fpm backpressure (FRRouting/frr#16220) <!-- #### How to verify it If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> <!-- #### Which release branch to backport (provide reason below if selected) - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 --> <!-- #### Tested branch (Please provide the tested image version) - Please provide tested image version - e.g. - [x] 20201231.100 - [ ] - [ ] --> <!-- #### Description for the changelog Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> <!-- #### Link to config_db schema for YANG module changes Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> <!-- #### A picture of a cute animal (not mandatory but encouraged) -->
<!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously #### How I did it Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic: * zebra: Use built in data structure counter (FRRouting/frr#16221) * Zebra fpm backpressure (FRRouting/frr#16220) <!-- #### How to verify it If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> <!-- #### Which release branch to backport (provide reason below if selected) - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 --> <!-- #### Tested branch (Please provide the tested image version) - Please provide tested image version - e.g. - [x] 20201231.100 - [ ] - [ ] --> <!-- #### Description for the changelog Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> <!-- #### Link to config_db schema for YANG module changes Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> <!-- #### A picture of a cute animal (not mandatory but encouraged) -->
…sonic-net#21146) Why I did it Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously How I did it Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic: zebra: Use built in data structure counter (zebra: Use built in data structure counter FRRouting/frr#16221) Zebra fpm backpressure (Zebra fpm backpressure FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
See individual commits