Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations and problem fixing for large scale ecmp from bgp #17229

Merged
merged 5 commits into from
Oct 25, 2024

Conversation

donaldsharp
Copy link
Member

See the individual commits.

bgp_update is a very expensive call.  Calling evpn_overlay_free
even when we have no evpn data to free is not trivial.  Let's
limit the call into this function until we actually have data to
free.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@frrbot frrbot bot added the bgp label Oct 24, 2024
@donaldsharp donaldsharp force-pushed the bgp_update_optimizations branch 2 times, most recently from fab3725 to 0ea35c6 Compare October 24, 2024 23:42
When running bestpath on a very large number of ecmp.
BGP ends up calling aspath_count a very very large number
of times, which results in ~15% cpu runtime in aspath_count_hops.
Modify the aspath to keep track of it's own count.  This results
in the function now taking up ~1.5% of the cpu runtime.  Enough
for the moment to be ignored.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
This is just a small optimization but when calling path_info_cmp
hundreds of millions of times this adds up.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
(gdb) bt
0  futex_wait (private=0, expected=2, futex_word=0x5c438e9a98d8) at ../sysdeps/nptl/futex-internal.h:146
1  __GI___lll_lock_wait (futex=futex@entry=0x5c438e9a98d8, private=0) at ./nptl/lowlevellock.c:49
2  0x00007af16d698002 in lll_mutex_lock_optimized (mutex=0x5c438e9a98d8) at ./nptl/pthread_mutex_lock.c:48
3  ___pthread_mutex_lock (mutex=0x5c438e9a98d8) at ./nptl/pthread_mutex_lock.c:93
4  0x00005c4369c17e70 in _frr_mtx_lock (mutex=0x5c438e9a98d8, func=0x5c4369dc2750 <__func__.265> "bgp_notify_send_internal") at ./lib/frr_pthread.h:258
5  0x00005c4369c1a07a in bgp_notify_send_internal (connection=0x5c438e9a98c0, code=8 '\b', sub_code=0 '\000', data=0x0, datalen=0, use_curr=true) at bgpd/bgp_packet.c:928
6  0x00005c4369c1a707 in bgp_notify_send (connection=0x5c438e9a98c0, code=8 '\b', sub_code=0 '\000') at bgpd/bgp_packet.c:1069
7  0x00005c4369bea422 in bgp_stop_with_notify (connection=0x5c438e9a98c0, code=8 '\b', sub_code=0 '\000') at bgpd/bgp_fsm.c:1597
8  0x00005c4369c18480 in bgp_packet_add (connection=0x5c438e9a98c0, peer=0x5c438e9b6010, s=0x7af15c06bf70) at bgpd/bgp_packet.c:151
9  0x00005c4369c19816 in bgp_keepalive_send (peer=0x5c438e9b6010) at bgpd/bgp_packet.c:639
10 0x00005c4369bf01fd in peer_process (hb=0x5c438ed05520, arg=0x7af16bdffaf0) at bgpd/bgp_keepalives.c:111
11 0x00007af16dacd8e6 in hash_iterate (hash=0x7af15c000be0, func=0x5c4369bf005e <peer_process>, arg=0x7af16bdffaf0) at lib/hash.c:252
12 0x00005c4369bf0679 in bgp_keepalives_start (arg=0x5c438e0db110) at bgpd/bgp_keepalives.c:214
13 0x00007af16dac9932 in frr_pthread_inner (arg=0x5c438e0db110) at lib/frr_pthread.c:180
14 0x00007af16d694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
15 0x00007af16d726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb)

The bgp keepalive pthread gets deadlocked with itself and consequently
the bgp master pthread gets locked when it attempts to lock
the peerhash_mtx, since it is also locked by the keepalive_pthread

The keepalive pthread is locking the peerhash_mtx in
bgp_keepalives_start.  Next the connection->io_mtx mutex in
bgp_keepalives_send is locked and then when it notices a problem it invokes
bgp_stop_with_notify which relocks the same mutex ( and of course
the relock causes it to get stuck on itself ).  This generates a
deadlock condition.

Modify the code to only hold the connection->io_mtx as short as
possible.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:44
1  __pthread_kill_internal (signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:78
2  __GI___pthread_kill (threadid=130719886083648, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
3  0x000076e399e42476 in __GI_raise (sig=6) at ../sysdeps/posix/raise.c:26
4  0x000076e39a34f950 in core_handler (signo=6, siginfo=0x76e3985fca30, context=0x76e3985fc900) at lib/sigevent.c:258
5  <signal handler called>
6  __pthread_kill_implementation (no_tid=0, signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:44
7  __pthread_kill_internal (signo=6, threadid=130719886083648) at ./nptl/pthread_kill.c:78
8  __GI___pthread_kill (threadid=130719886083648, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
9  0x000076e399e42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
10 0x000076e399e287f3 in __GI_abort () at ./stdlib/abort.c:79
11 0x000076e39a39874b in _zlog_assert_failed (xref=0x76e39a46cca0 <_xref.27>, extra=0x0) at lib/zlog.c:789
12 0x000076e39a369dde in cancel_event_helper (m=0x5eda32df5e40, arg=0x5eda33afeed0, flags=1) at lib/event.c:1428
13 0x000076e39a369ef6 in event_cancel_event_ready (m=0x5eda32df5e40, arg=0x5eda33afeed0) at lib/event.c:1470
14 0x00005eda0a94a5b3 in bgp_stop (connection=0x5eda33afeed0) at bgpd/bgp_fsm.c:1355
15 0x00005eda0a94b4ae in bgp_stop_with_notify (connection=0x5eda33afeed0, code=8 '\b', sub_code=0 '\000') at bgpd/bgp_fsm.c:1610
16 0x00005eda0a979498 in bgp_packet_add (connection=0x5eda33afeed0, peer=0x5eda33b11800, s=0x76e3880daf90) at bgpd/bgp_packet.c:152
17 0x00005eda0a97a80f in bgp_keepalive_send (peer=0x5eda33b11800) at bgpd/bgp_packet.c:639
18 0x00005eda0a9511fd in peer_process (hb=0x5eda33c9ab80, arg=0x76e3985ffaf0) at bgpd/bgp_keepalives.c:111
19 0x000076e39a2cd8e6 in hash_iterate (hash=0x76e388000be0, func=0x5eda0a95105e <peer_process>, arg=0x76e3985ffaf0) at lib/hash.c:252
20 0x00005eda0a951679 in bgp_keepalives_start (arg=0x5eda3306af80) at bgpd/bgp_keepalives.c:214
21 0x000076e39a2c9932 in frr_pthread_inner (arg=0x5eda3306af80) at lib/frr_pthread.c:180
22 0x000076e399e94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
23 0x000076e399f26850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) f 12
12 0x000076e39a369dde in cancel_event_helper (m=0x5eda32df5e40, arg=0x5eda33afeed0, flags=1) at lib/event.c:1428
1428		assert(m->owner == pthread_self());

In this decode the attempt to cancel the connection's events from
the wrong thread is causing the crash.  Modify the code to create an
event on the bm->master to cancel the events for the connection.

Signed-off-by: Donald Sharp <sharpd@nvidia.com>
@donaldsharp donaldsharp force-pushed the bgp_update_optimizations branch from 0ea35c6 to 138935a Compare October 25, 2024 01:01
@ton31337 ton31337 merged commit ba3836c into FRRouting:master Oct 25, 2024
11 checks passed
lguohan pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Dec 19, 2024
Why I did it
Added patches from FRR to support scale of BGP neighbors to 256/514. Below are the patches

Patch	FRR Pull request
0069-lib-nexthop-code-should-use-uint16_t-for-nexthop-cou.patch	FRRouting/frr#16967
0070-Allow-16-bit-size-for-nexthops.patch	FRRouting/frr#17023
0071-zebra-Only-notify-dplane-work-pthread-when-needed.patch	FRRouting/frr#17062
0072-Fix-up-improper-handling-of-nexthops-for-nexthop-tra.patch	FRRouting/frr#17076
0073-remove-in6addr-cmp.patch	FRRouting/frr#17312
0074-bgp-best-port-reordering.patch	FRRouting/frr#15572
0075-bgp-mp-info-changes.patch	FRRouting/frr#16961
0076-Optimizations-and-problem-fixing-for-large-scale-ecmp-from-bgp.patch	FRRouting/frr#17229
dgsudharsan added a commit to dgsudharsan/sonic-buildimage that referenced this pull request Dec 23, 2024
…net#21199)

Why I did it
Added patches from FRR to support scale of BGP neighbors to 256/514. Below are the patches

Patch	FRR Pull request
0069-lib-nexthop-code-should-use-uint16_t-for-nexthop-cou.patch	FRRouting/frr#16967
0070-Allow-16-bit-size-for-nexthops.patch	FRRouting/frr#17023
0071-zebra-Only-notify-dplane-work-pthread-when-needed.patch	FRRouting/frr#17062
0072-Fix-up-improper-handling-of-nexthops-for-nexthop-tra.patch	FRRouting/frr#17076
0073-remove-in6addr-cmp.patch	FRRouting/frr#17312
0074-bgp-best-port-reordering.patch	FRRouting/frr#15572
0075-bgp-mp-info-changes.patch	FRRouting/frr#16961
0076-Optimizations-and-problem-fixing-for-large-scale-ecmp-from-bgp.patch	FRRouting/frr#17229
github-actions bot pushed a commit to bradh352/sonic-buildimage that referenced this pull request Jan 2, 2025
…net#21199)

Why I did it
Added patches from FRR to support scale of BGP neighbors to 256/514. Below are the patches

Patch	FRR Pull request
0069-lib-nexthop-code-should-use-uint16_t-for-nexthop-cou.patch	FRRouting/frr#16967
0070-Allow-16-bit-size-for-nexthops.patch	FRRouting/frr#17023
0071-zebra-Only-notify-dplane-work-pthread-when-needed.patch	FRRouting/frr#17062
0072-Fix-up-improper-handling-of-nexthops-for-nexthop-tra.patch	FRRouting/frr#17076
0073-remove-in6addr-cmp.patch	FRRouting/frr#17312
0074-bgp-best-port-reordering.patch	FRRouting/frr#15572
0075-bgp-mp-info-changes.patch	FRRouting/frr#16961
0076-Optimizations-and-problem-fixing-for-large-scale-ecmp-from-bgp.patch	FRRouting/frr#17229
github-actions bot pushed a commit to bradh352/sonic-buildimage that referenced this pull request Jan 2, 2025
…net#21199)

Why I did it
Added patches from FRR to support scale of BGP neighbors to 256/514. Below are the patches

Patch	FRR Pull request
0069-lib-nexthop-code-should-use-uint16_t-for-nexthop-cou.patch	FRRouting/frr#16967
0070-Allow-16-bit-size-for-nexthops.patch	FRRouting/frr#17023
0071-zebra-Only-notify-dplane-work-pthread-when-needed.patch	FRRouting/frr#17062
0072-Fix-up-improper-handling-of-nexthops-for-nexthop-tra.patch	FRRouting/frr#17076
0073-remove-in6addr-cmp.patch	FRRouting/frr#17312
0074-bgp-best-port-reordering.patch	FRRouting/frr#15572
0075-bgp-mp-info-changes.patch	FRRouting/frr#16961
0076-Optimizations-and-problem-fixing-for-large-scale-ecmp-from-bgp.patch	FRRouting/frr#17229
github-actions bot pushed a commit to bradh352/sonic-buildimage that referenced this pull request Jan 2, 2025
…net#21199)

Why I did it
Added patches from FRR to support scale of BGP neighbors to 256/514. Below are the patches

Patch	FRR Pull request
0069-lib-nexthop-code-should-use-uint16_t-for-nexthop-cou.patch	FRRouting/frr#16967
0070-Allow-16-bit-size-for-nexthops.patch	FRRouting/frr#17023
0071-zebra-Only-notify-dplane-work-pthread-when-needed.patch	FRRouting/frr#17062
0072-Fix-up-improper-handling-of-nexthops-for-nexthop-tra.patch	FRRouting/frr#17076
0073-remove-in6addr-cmp.patch	FRRouting/frr#17312
0074-bgp-best-port-reordering.patch	FRRouting/frr#15572
0075-bgp-mp-info-changes.patch	FRRouting/frr#16961
0076-Optimizations-and-problem-fixing-for-large-scale-ecmp-from-bgp.patch	FRRouting/frr#17229
mssonicbld added a commit to mssonicbld/sonic-buildimage-msft that referenced this pull request Jan 8, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it

Added patches from FRR to support scale of BGP neighbors to 256/514. Below are the patches

| Patch | FRR Pull request|
| ------  |--------- |
| 0069-lib-nexthop-code-should-use-uint16_t-for-nexthop-cou.patch | FRRouting/frr#16967 |
| 0070-Allow-16-bit-size-for-nexthops.patch | FRRouting/frr#17023  |
| 0071-zebra-Only-notify-dplane-work-pthread-when-needed.patch | FRRouting/frr#17062 |
| 0072-Fix-up-improper-handling-of-nexthops-for-nexthop-tra.patch | FRRouting/frr#17076 |
| 0073-remove-in6addr-cmp.patch | FRRouting/frr#17312 |
| 0074-bgp-best-port-reordering.patch | FRRouting/frr#15572 |
| 0075-bgp-mp-info-changes.patch | FRRouting/frr#16961 |
| 0076-Optimizations-and-problem-fixing-for-large-scale-ecmp-from-bgp.patch | FRRouting/frr#17229 |
##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
VladimirKuk pushed a commit to Marvell-switching/sonic-buildimage that referenced this pull request Jan 21, 2025
…net#21199)

Why I did it
Added patches from FRR to support scale of BGP neighbors to 256/514. Below are the patches

Patch	FRR Pull request
0069-lib-nexthop-code-should-use-uint16_t-for-nexthop-cou.patch	FRRouting/frr#16967
0070-Allow-16-bit-size-for-nexthops.patch	FRRouting/frr#17023
0071-zebra-Only-notify-dplane-work-pthread-when-needed.patch	FRRouting/frr#17062
0072-Fix-up-improper-handling-of-nexthops-for-nexthop-tra.patch	FRRouting/frr#17076
0073-remove-in6addr-cmp.patch	FRRouting/frr#17312
0074-bgp-best-port-reordering.patch	FRRouting/frr#15572
0075-bgp-mp-info-changes.patch	FRRouting/frr#16961
0076-Optimizations-and-problem-fixing-for-large-scale-ecmp-from-bgp.patch	FRRouting/frr#17229
@donaldsharp donaldsharp mentioned this pull request Jan 24, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants