Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zebra fpm backpressure #16220

Merged
merged 4 commits into from
Sep 6, 2024
Merged

Conversation

donaldsharp
Copy link
Member

See individual commits

@mjstapp mjstapp self-requested a review June 13, 2024 16:55
@donaldsharp donaldsharp force-pushed the zebra_fpm_backpressure branch 3 times, most recently from 5aabbcd to 69981f9 Compare June 14, 2024 15:51
", holding off work",
__func__, prov->dp_name, curr);
counter = 0;
} else if (out_curr >= (uint64_t)limit) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be a situation where the input and output queue is hitting the limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but it doesn't matter from my perspective. Both cases we do the same thing. It's sufficient to just state the first one encountered in a debug and move on.

}
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err. I intentionally added a newline to space things out a bit. So I am a bit confused

* Let's limit the work to how what can be put on the
* in or out queue without going over
*/
limit -= (curr > out_curr) ? curr : out_curr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: limit -= max(curr, out_curr);?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

* in or out queue without going over
*/
limit -= (curr > out_curr) ? curr : out_curr;
if ((uint32_t)limit != zdplane_info.dg_updates_per_cycle)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dg_updates_per_cycle is uint32, do we need a cast here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

limit is not a uint32_t and as such it will complain about the size comparisons

if (curr >= (uint64_t)limit) {
reschedule = true;
if (IS_ZEBRA_DEBUG_DPLANE_DETAIL)
zlog_debug("%s: Next Provider(%s) Input queue is %" PRIu64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, printing next_prov->dp_name would be useful here, otherwise hard to distinguish?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are printing it?

@donaldsharp donaldsharp force-pushed the zebra_fpm_backpressure branch 3 times, most recently from 3261d90 to 490cfa2 Compare June 18, 2024 20:08
zebra/dplane_fpm_nl.c Show resolved Hide resolved
zebra/zebra_dplane.c Show resolved Hide resolved
@pbrisset pbrisset self-requested a review July 9, 2024 15:33
lguohan pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Aug 6, 2024
…rns (#19717)

Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns

How I did it
New patches that were added:

Patch	FRR Pull request
0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch	FRRouting/frr#15411
0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch	FRRouting/frr#15524
0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch	FRRouting/frr#15326
0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch	FRRouting/frr#15524
0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch	FRRouting/frr#15524
0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch	FRRouting/frr#15624
0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch	FRRouting/frr#15728
0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch	FRRouting/frr#15727
0038-zebra-Actually-display-I-O-buffer-sizes.patch	FRRouting/frr#15708
0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch	FRRouting/frr#15769
0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch	FRRouting/frr#16034
0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch	FRRouting/frr#16035
0042-zebra-Use-built-in-data-structure-counter.patch	FRRouting/frr#16221
0043-zebra-Use-the-ctx-queue-counters.patch	FRRouting/frr#16220
0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch	FRRouting/frr#16220
0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch	FRRouting/frr#16220
0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch	FRRouting/frr#16220
0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch	FRRouting/frr#16234
0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch	FRRouting/frr#16368
0049-bgpd-backpressure-Improve-debuggability.patch	FRRouting/frr#16368
0050-bgpd-backpressure-Avoid-use-after-free.patch	FRRouting/frr#16437
0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch	FRRouting/frr#16416
0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch	FRRouting/frr#16416
matiAlfaro pushed a commit to Marvell-switching/sonic-buildimage that referenced this pull request Aug 6, 2024
…rns (sonic-net#19717)

Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns

How I did it
New patches that were added:

Patch	FRR Pull request
0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch	FRRouting/frr#15411
0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch	FRRouting/frr#15524
0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch	FRRouting/frr#15326
0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch	FRRouting/frr#15524
0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch	FRRouting/frr#15524
0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch	FRRouting/frr#15624
0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch	FRRouting/frr#15728
0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch	FRRouting/frr#15727
0038-zebra-Actually-display-I-O-buffer-sizes.patch	FRRouting/frr#15708
0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch	FRRouting/frr#15769
0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch	FRRouting/frr#16034
0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch	FRRouting/frr#16035
0042-zebra-Use-built-in-data-structure-counter.patch	FRRouting/frr#16221
0043-zebra-Use-the-ctx-queue-counters.patch	FRRouting/frr#16220
0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch	FRRouting/frr#16220
0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch	FRRouting/frr#16220
0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch	FRRouting/frr#16220
0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch	FRRouting/frr#16234
0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch	FRRouting/frr#16368
0049-bgpd-backpressure-Improve-debuggability.patch	FRRouting/frr#16368
0050-bgpd-backpressure-Avoid-use-after-free.patch	FRRouting/frr#16437
0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch	FRRouting/frr#16416
0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch	FRRouting/frr#16416
@donaldsharp donaldsharp force-pushed the zebra_fpm_backpressure branch from 490cfa2 to 7bb5cb7 Compare August 16, 2024 12:27
@frrbot frrbot bot added the zebra label Aug 16, 2024
@donaldsharp donaldsharp force-pushed the zebra_fpm_backpressure branch from 7bb5cb7 to 5eda763 Compare August 30, 2024 12:42
The ctx queue data structures already have a counter
associated with them.  Let's just use them instead.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently when the dplane_thread_loop is run, it moves contexts
from the dg_update_list and puts the contexts on the input queue
of the first provider.  This provider is given a chance to run
and then the items on the output queue are pulled off and placed
on the input queue of the next provider.  Rinse/Repeat down through
the entire list of providers.  Now imagine that we have a list
of multiple providers and the last provider is getting backed up.
Contexts will end up sticking in the input Queue of the `slow`
provider.  This can grow without bounds.  This is a real problem
when you have a situation where an interface is flapping and an
upper level protocol is sending a continous stream of route
updates to reflect the change in ecmp.  You can end up with
a very very large backlog of contexts.  This is bad because
zebra can easily grow to a very very large memory size and on
restricted systems you can run out of memory.  Fortunately
for us, the MetaQ already participates with this process
by not doing more route processing until the dg_update_list
goes below the working limit of dg_updates_per_cycle.  Thus
if FRR modifies the behavior of this loop to not move more
contexts onto the input queue if either the input queue
or output queue of the next provider has reached this limit.
FRR will naturaly start auto handling backpressure for the dplane
context system and memory will not go out of control.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dplane providers have a concept of input queues
and output queues.  These queues are chained together
during normal operation.  The code in zebra also has
a feedback mechanism where the MetaQ will not run when
the first input queue is backed up.  Having the dplane_fpm_nl
code grab all contexts when it is backed up prevents
this system from behaving appropriately.

Modify the code to not add to the dplane_fpm_nl's internal
queue when it is already full.  This will allow the backpressure
to work appropriately in zebra proper.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The show zebra dplane provider command was ommitting
the input and output queues to the dplane itself.
It would be nice to have this insight as well.

New output:
r1# show zebra dplane providers
dataplane Incoming Queue from Zebra: 100
Zebra dataplane providers:
  Kernel (1): in: 6, q: 0, q_max: 3, out: 6, q: 14, q_max: 3
  dplane_fpm_nl (2): in: 6, q: 10, q_max: 3, out: 6, q: 0, q_max: 3
dataplane Outgoing Queue to Zebra: 43
r1#

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@donaldsharp donaldsharp force-pushed the zebra_fpm_backpressure branch from 5eda763 to 98b11de Compare September 5, 2024 19:52
Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me now, thanks

@ton31337 ton31337 merged commit ebfcbbc into FRRouting:master Sep 6, 2024
11 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request Oct 22, 2024
…rns (sonic-net#19717)

Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns

How I did it
New patches that were added:

Patch	FRR Pull request
0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch	FRRouting/frr#15411
0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch	FRRouting/frr#15524
0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch	FRRouting/frr#15326
0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch	FRRouting/frr#15524
0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch	FRRouting/frr#15524
0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch	FRRouting/frr#15624
0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch	FRRouting/frr#15728
0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch	FRRouting/frr#15727
0038-zebra-Actually-display-I-O-buffer-sizes.patch	FRRouting/frr#15708
0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch	FRRouting/frr#15769
0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch	FRRouting/frr#16034
0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch	FRRouting/frr#16035
0042-zebra-Use-built-in-data-structure-counter.patch	FRRouting/frr#16221
0043-zebra-Use-the-ctx-queue-counters.patch	FRRouting/frr#16220
0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch	FRRouting/frr#16220
0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch	FRRouting/frr#16220
0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch	FRRouting/frr#16220
0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch	FRRouting/frr#16234
0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch	FRRouting/frr#16368
0049-bgpd-backpressure-Improve-debuggability.patch	FRRouting/frr#16368
0050-bgpd-backpressure-Avoid-use-after-free.patch	FRRouting/frr#16437
0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch	FRRouting/frr#16416
0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch	FRRouting/frr#16416
mssonicbld pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Oct 23, 2024
…rns (#19717)

Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns

How I did it
New patches that were added:

Patch	FRR Pull request
0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch	FRRouting/frr#15411
0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch	FRRouting/frr#15524
0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch	FRRouting/frr#15326
0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch	FRRouting/frr#15524
0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch	FRRouting/frr#15524
0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch	FRRouting/frr#15624
0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch	FRRouting/frr#15728
0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch	FRRouting/frr#15727
0038-zebra-Actually-display-I-O-buffer-sizes.patch	FRRouting/frr#15708
0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch	FRRouting/frr#15769
0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch	FRRouting/frr#16034
0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch	FRRouting/frr#16035
0042-zebra-Use-built-in-data-structure-counter.patch	FRRouting/frr#16221
0043-zebra-Use-the-ctx-queue-counters.patch	FRRouting/frr#16220
0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch	FRRouting/frr#16220
0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch	FRRouting/frr#16220
0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch	FRRouting/frr#16220
0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch	FRRouting/frr#16234
0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch	FRRouting/frr#16368
0049-bgpd-backpressure-Improve-debuggability.patch	FRRouting/frr#16368
0050-bgpd-backpressure-Avoid-use-after-free.patch	FRRouting/frr#16437
0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch	FRRouting/frr#16416
0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch	FRRouting/frr#16416
cscarpitta added a commit to cscarpitta/sonic-buildimage that referenced this pull request Dec 12, 2024
Apply the following patch to dplane_fpm_sonic:
* Zebra fpm backpressure (FRRouting/frr#16220)

Signed-off-by: cscarpitta <cscarpit@cisco.com>
cscarpitta added a commit to cscarpitta/sonic-buildimage that referenced this pull request Dec 12, 2024
Apply the following patch to dplane_fpm_sonic:
* Zebra fpm backpressure (FRRouting/frr#16220)

Signed-off-by: cscarpitta <cscarpit@cisco.com>
cscarpitta added a commit to cscarpitta/sonic-buildimage that referenced this pull request Dec 12, 2024
Apply the following patch to dplane_fpm_sonic:
* Zebra fpm backpressure (FRRouting/frr#16220)

Signed-off-by: cscarpitta <cscarpit@cisco.com>
lguohan pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Dec 16, 2024
…#21146)

Why I did it
Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously

How I did it
Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic:

zebra: Use built in data structure counter (zebra: Use built in data structure counter FRRouting/frr#16221)
Zebra fpm backpressure (Zebra fpm backpressure FRRouting/frr#16220)

Signed-off-by: cscarpitta <cscarpit@cisco.com>
mssonicbld added a commit to mssonicbld/sonic-buildimage that referenced this pull request Jan 8, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it

Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously

#### How I did it

Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic:
* zebra: Use built in data structure counter (FRRouting/frr#16221)
* Zebra fpm backpressure (FRRouting/frr#16220)

<!--
#### How to verify it

If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

<!--
#### Which release branch to backport (provide reason below if selected)

- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

-->
<!--
#### Tested branch (Please provide the tested image version)

- Please provide tested image version
- e.g.
- [x] 20201231.100

- [ ]
- [ ]
-->

<!--
#### Description for the changelog
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

<!--
#### Link to config_db schema for YANG module changes
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

<!--
#### A picture of a cute animal (not mandatory but encouraged)
-->
mssonicbld added a commit to sonic-net/sonic-buildimage that referenced this pull request Jan 9, 2025
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it

Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously

#### How I did it

Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic:
* zebra: Use built in data structure counter (FRRouting/frr#16221)
* Zebra fpm backpressure (FRRouting/frr#16220)

<!--
#### How to verify it

If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

<!--
#### Which release branch to backport (provide reason below if selected)

- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

-->
<!--
#### Tested branch (Please provide the tested image version)

- Please provide tested image version
- e.g.
- [x] 20201231.100

- [ ]
- [ ]
-->

<!--
#### Description for the changelog
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

<!--
#### Link to config_db schema for YANG module changes
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

<!--
#### A picture of a cute animal (not mandatory but encouraged)
-->
VladimirKuk pushed a commit to Marvell-switching/sonic-buildimage that referenced this pull request Jan 21, 2025
…sonic-net#21146)

Why I did it
Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously

How I did it
Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic:

zebra: Use built in data structure counter (zebra: Use built in data structure counter FRRouting/frr#16221)
Zebra fpm backpressure (Zebra fpm backpressure FRRouting/frr#16220)

Signed-off-by: cscarpitta <cscarpit@cisco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants