Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bmp various fixes #17373

Closed
wants to merge 3 commits into from
Closed

Conversation

pguibert6WIND
Copy link
Member

A commit to re-send post-policy BGP updates when nexthop reachability is changed.

A commit related to an ASAN issue encountered with rx_open and tx_open buffers.

bgpd/bgp_route.c Outdated
@@ -4182,6 +4182,12 @@ static void bgp_process_internal(struct bgp *bgp, struct bgp_dest *dest,
return;
}

void bgp_process_hook(struct bgp *bgp, struct peer *peer, struct bgp_dest *dest, afi_t afi,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll bite: why?
why add a layer that just rearranges the arguments, and hides the "withdraw" boolean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the layer, it is because it is not possible to call a HOOK from outside of bgp_route.c
  • the withdraw, I can maintain it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but ... isn't that what hooks are for - they're hooks from our code out to other callbacks? what would it mean to have the callbacks for an existing hook are called from ... some other places?
this looks like a "leak" event - maybe it should have its own hook (if you think there's a use-case for that)
and yes, since there are both "update" and "withdraw" leak events, you should ... have a way to express that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, even if parameters are similar, this is not exactly the correct hook I wanted to call.
BMP is interested in having nexthop tracking hooks, in order to consider a path as reachable or not, and then further update the post-rib.

@pguibert6WIND pguibert6WIND force-pushed the bmp_fix branch 2 times, most recently from 58dbc35 to efbc62f Compare November 7, 2024 21:26
@github-actions github-actions bot added size/M and removed size/S labels Nov 7, 2024
bgpd/bgp_bmp.c Outdated Show resolved Hide resolved
@ton31337
Copy link
Member

ton31337 commented Nov 8, 2024

@pguibert6WIND
Copy link
Member Author

Can we fix also https://ci1.netdef.org/browse/FRR-PULLREQ3-CHECKOUT-6046?

I satisfied it, but frrbot will not be happy

@pguibert6WIND pguibert6WIND force-pushed the bmp_fix branch 3 times, most recently from 6ed7ba5 to ec11074 Compare November 18, 2024 14:43
@ton31337
Copy link
Member

@Mergifyio backport stable/10.2 stable/10.1 stable/10.0 stable/9.1

Copy link

mergify bot commented Nov 20, 2024

backport stable/10.2 stable/10.1 stable/10.0 stable/9.1

🟠 Waiting for conditions to match

  • merged [📌 backport requirement]

Copy link
Contributor

@mxyns mxyns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've already encountered this issue before, thanks for taking care of it with this PR :)

prefix2str(&dest->rn->p, pfxprint, sizeof(pfxprint));
frrtrace(4, frr_bgp, bmp_nht_path_valid, bgp, pfxprint, path, valid);
}
if (bgp->peer_self == path->peer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't "redistributed networks relevant for bmp" ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

redistributed networks are the ones that are added in BGP when doing:

router bgp 65500
address-family ipv4 unicast
redistribute static

because they originate from already selected routes, those routes are always available, and are not impacted by BGP nexthop tracking. If there is a reachability problem, BGP will receive a bgp_redistribute_delete command from the route that disappeared from static route list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh okay i see, you meant "not relevant for next hop tracking" not "for bmp"
i understand it now thx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to clarify, when nht does not work, here is the output on a BMP collector:

{"peer_type": "route distinguisher instance", "policy": "pre-policy", "ipv6": true, "peer_ip": "192:167::3", "peer_distinguisher": "555:1", "peer_asn": 65501, "peer_bgp_id": "192.168.1.3", "timestamp": "2024-11-28 17:05:45.111963", "bmp_log_type": "update", "origin": "IGP", "as_path": "", "afi": 2, "safi": 1, "nxhp_ip": "192:167::3", "nxhp_link-local": "fe80::7caf:c7ff:fe63:2b4d", "ip_prefix": "2001::1125/128", "seq": 32}
{"peer_type": "route distinguisher instance", "policy": "post-policy", "ipv6": true, "peer_ip": "192:167::3", "peer_distinguisher": "555:1", "peer_asn": 65501, "peer_bgp_id": "192.168.1.3", "timestamp": "2024-11-28 17:05:45.111963", "bmp_log_type": "update", "origin": "IGP", "as_path": "", "afi": 2, "safi": 1, "nxhp_ip": "192:167::3", "nxhp_link-local": "fe80::7caf:c7ff:fe63:2b4d", "ip_prefix": "2001::1125/128", "seq": 33}
{"peer_type": "route distinguisher instance", "policy": "pre-policy", "ipv6": true, "peer_ip": "192:167::3", "peer_distinguisher": "555:1", "peer_asn": 65501, "peer_bgp_id": "192.168.1.3", "timestamp": "2024-11-28 17:05:45.111962", "bmp_log_type": "update", "origin": "IGP", "as_path": "", "afi": 2, "safi": 1, "nxhp_ip": "192:167::3", "nxhp_link-local": "fe80::7caf:c7ff:fe63:2b4d", "ip_prefix": "2001::1125/128", "seq": 34}
{"peer_type": "loc-rib instance", "is_filtered": false, "policy": "loc-rib", "peer_distinguisher": "555:1", "peer_asn": 65501, "peer_bgp_id": "192.168.1.1", "timestamp": "2024-11-28 17:05:45.111963", "bmp_log_type": "update", "origin": "IGP", "as_path": "", "afi": 2, "safi": 1, "nxhp_ip": "192:167::3", "nxhp_link-local": "fe80::7caf:c7ff:fe63:2b4d", "ip_prefix": "2001::1125/128", "seq": 35}

As you can see, the pre-policy message for 2001::1125/128 is sent twice.
But I think it is not a big deal, because all the information in the entry tells if it is an addpath entry or not, or if it is an ECMP entry. so we can not make mistakes when reading both messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, in #14847 the adj-rib-in post-policy is moved to the queue of loc-rib since they run lookups on the same structure (the bgp->rib)
because multiple monitoring modes share queues, i added a flag field to the bqe that tells which rib must send a message. this should fix this double message issue too :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and i agree, it's not ideal but it's not a huge deal either. it will not confuse a collector

@pguibert6WIND
Copy link
Member Author

@mjstapp , @eqvinox , please review this pull request.

@pguibert6WIND pguibert6WIND force-pushed the bmp_fix branch 2 times, most recently from 0e91188 to 0a37e19 Compare December 10, 2024 17:55
@pguibert6WIND
Copy link
Member Author

ci:rerun unrelated ldp test failed

The following ASAN error can be seen.

> ERROR: AddressSanitizer: attempting to call malloc_usable_size() for pointer which is not owned: 0x608000036c20
>     #0 0x7f3d7a4b5425 in __interceptor_malloc_usable_size ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:198
>     FRRouting#1 0x7f3d7a426a16 in __sanitizer::BufferedStackTrace::Unwind(unsigned long, unsigned long, void*, bool, unsigned int) ../../../../src/libsanitizer/sanitizer_common
> /sanitizer_stacktrace.h:122
>     FRRouting#2 0x7f3d7a426a16 in __asan::asan_malloc_usable_size(void const*, unsigned long, unsigned long) ../../../../src/libsanitizer/asan/asan_allocator.cpp:1074
>     FRRouting#3 0x7f3d7a03f330 in mt_count_free lib/memory.c:78
>     FRRouting#4 0x7f3d7a03f330 in qfree lib/memory.c:130
>     FRRouting#5 0x7f3d76ccf89b in bmp_peer_status_changed bgpd/bgp_bmp.c:982
>     FRRouting#6 0x560ae2aa6a94 in hook_call_peer_status_changed bgpd/bgp_fsm.c:47
>     FRRouting#7 0x560ae2aa6a94 in bgp_fsm_change_status bgpd/bgp_fsm.c:1287
>     FRRouting#8 0x560ae2c4f2e5 in peer_delete bgpd/bgpd.c:2777
>     FRRouting#9 0x560ae2c58d24 in bgp_delete bgpd/bgpd.c:4140
>     FRRouting#10 0x560ae2bbb47e in no_router_bgp bgpd/bgp_vty.c:1764
>     FRRouting#11 0x7f3d79fb74ed in cmd_execute_command_real lib/command.c:1003
>     FRRouting#12 0x7f3d79fb78a3 in cmd_execute_command lib/command.c:1062
>     FRRouting#13 0x7f3d79fb7e03 in cmd_execute lib/command.c:1228
>     FRRouting#14 0x7f3d7a107b53 in vty_command lib/vty.c:625
>     FRRouting#15 0x7f3d7a109902 in vty_execute lib/vty.c:1388
>     FRRouting#16 0x7f3d7a10cc32 in vtysh_read lib/vty.c:2400
>     FRRouting#17 0x7f3d7a0f848b in event_call lib/event.c:2019
>     FRRouting#18 0x7f3d7a01e627 in frr_run lib/libfrr.c:1232
>     FRRouting#19 0x560ae29e0037 in main bgpd/bgp_main.c:555
>     FRRouting#20 0x7f3d79a29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
>     FRRouting#21 0x7f3d79a29e3f in __libc_start_main_impl ../csu/libc-start.c:392
>     FRRouting#22 0x560ae29e4ef4 in _start (/usr/lib/frr/bgpd+0x2eeef4)
>
> 0x608000036c20 is located 0 bytes inside of 81-byte region [0x608000036c20,0x608000036c71)
> freed by thread T0 here:
>     #0 0x7f3d7a4b4537 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:127
>     FRRouting#1 0x7f3d76ccf85f in bmp_peer_status_changed bgpd/bgp_bmp.c:981
>     FRRouting#2 0x560ae2aa6a94 in hook_call_peer_status_changed bgpd/bgp_fsm.c:47
>     FRRouting#3 0x560ae2aa6a94 in bgp_fsm_change_status bgpd/bgp_fsm.c:1287
>     FRRouting#4 0x560ae2c4f2e5 in peer_delete bgpd/bgpd.c:2777
>     FRRouting#5 0x560ae2c58d24 in bgp_delete bgpd/bgpd.c:4140
>     FRRouting#6 0x560ae2bbb47e in no_router_bgp bgpd/bgp_vty.c:1764
>     FRRouting#7 0x7f3d79fb74ed in cmd_execute_command_real lib/command.c:1003
>     FRRouting#8 0x7f3d79fb78a3 in cmd_execute_command lib/command.c:1062
>     FRRouting#9 0x7f3d79fb7e03 in cmd_execute lib/command.c:1228
>     FRRouting#10 0x7f3d7a107b53 in vty_command lib/vty.c:625
>     FRRouting#11 0x7f3d7a109902 in vty_execute lib/vty.c:1388
>     FRRouting#12 0x7f3d7a10cc32 in vtysh_read lib/vty.c:2400
>     FRRouting#13 0x7f3d7a0f848b in event_call lib/event.c:2019
>     FRRouting#14 0x7f3d7a01e627 in frr_run lib/libfrr.c:1232
>     FRRouting#15 0x560ae29e0037 in main bgpd/bgp_main.c:555
>     FRRouting#16 0x7f3d79a29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
>
> previously allocated by thread T0 here:
>     #0 0x7f3d7a4b4887 in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
>     FRRouting#1 0x7f3d7a03f0e9 in qmalloc lib/memory.c:101
>     FRRouting#2 0x7f3d76cd0166 in bmp_bgp_peer_vrf bgpd/bgp_bmp.c:2194
>     FRRouting#3 0x7f3d76cd0166 in bmp_bgp_update_vrf_status bgpd/bgp_bmp.c:2236
>     FRRouting#4 0x7f3d76cd29b8 in bmp_vrf_state_changed bgpd/bgp_bmp.c:3479
>     FRRouting#5 0x560ae2c45b34 in hook_call_bgp_instance_state bgpd/bgpd.c:88
>     FRRouting#6 0x560ae2c4d158 in bgp_instance_up bgpd/bgpd.c:3936
>     FRRouting#7 0x560ae29e5ed1 in bgp_vrf_enable bgpd/bgp_main.c:299
>     FRRouting#8 0x7f3d7a0ff8b1 in vrf_enable lib/vrf.c:286
>     FRRouting#9 0x7f3d7a0ff8b1 in vrf_enable lib/vrf.c:275
>     FRRouting#10 0x7f3d7a12ab66 in zclient_vrf_add lib/zclient.c:2561
>     FRRouting#11 0x7f3d7a12eb43 in zclient_read lib/zclient.c:4624
>     FRRouting#12 0x7f3d7a0f848b in event_call lib/event.c:2019
>     FRRouting#13 0x7f3d7a01e627 in frr_run lib/libfrr.c:1232
>     FRRouting#14 0x560ae29e0037 in main bgpd/bgp_main.c:555
>     FRRouting#15 0x7f3d79a29d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
The following warning can be seen:
> In file included from ./bgpd/bgp_trace.h:21,
>                  from bgpd/bgp_io.c:27:
> bgpd/bgp_io.c: In function ‘read_ibuf_work’:
> bgpd/bgp_io.c:202:53: warning: passing argument 1 of ‘lttng_ust_tracepoint_cb_frr_bgp___packet_read’ from incompatible pointer type [-Wincompatible-pointer-types]
>   202 |         frrtrace(2, frr_bgp, packet_read, connection->peer, pkt);
>       |                                           ~~~~~~~~~~^~~~~~
>       |                                                     |
>       |                                                     struct peer *
> bgpd/bgp_io.c:202:9: note: in expansion of macro ‘frrtrace’
>   202 |         frrtrace(2, frr_bgp, packet_read, connection->peer, pkt);
>       |         ^~~~~~~~
> In file included from ./bgpd/bgp_trace.h:21,
>                  from bgpd/bgp_io.c:27:
> ./bgpd/bgp_trace.h:57:43: note: expected ‘struct peer_connection *’ but argument is of type ‘struct peer *’
>    57 |         TP_ARGS(struct peer_connection *, connection, struct stream *, pkt),
>       |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~

Use the appropriate connection parameter when calling the trace.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
When a BGP listener configured with BMP receives the first BGP
IPv6 update from a connected BGP IPv6 peer, the BMP collector
receives a withdraw post-policy message.

> {"peer_type": "route distinguisher instance", "policy": "post-policy",
> "ipv6": true, "peer_ip": "192:167::3", "peer_distinguisher": "444:1",
> "peer_asn": 65501, "peer_bgp_id": "192.168.1.3", "timestamp":
> "2024-10-29 11:44:47.111962", "bmp_log_type": "withdraw", "afi": 2,
> "safi": 1, "ip_prefix": "2001::1125/128", "seq": 22}
> {"peer_type": "route distinguisher instance", "policy": "pre-policy",
> "ipv6": true, "peer_ip": "192:167::3", "peer_distinguisher": "444:1",
> "peer_asn": 65501, "peer_bgp_id": "192.168.1.3", "timestamp":
> "2024-10-29 11:44:47.111963", "bmp_log_type": "update", "origin":
> "IGP", "as_path": "", "afi": 2, "safi": 1, "nxhp_ip": "192:167::3",
> "nxhp_link-local": "fe80::7063:d8ff:fedb:9e11", "ip_prefix": "2001::1125/128", "seq": 23}

Actually, the BGP update is not valid, and BMP considers it as a
withdraw message. The BGP upate is not valid, because the nexthop
reachability is unknown at the time of reception, and no other
BMP message is sent.

Fix this by re-sending a BMP post update message when nexthop
tracking becomes successfull. Generalise the re-sending of
messages when nexthop tracking changes.

Signed-off-by: Philippe Guibert <philippe.guibert@6wind.com>
@pguibert6WIND
Copy link
Member Author

replaced/enlarged by #17733 .

@ton31337
Copy link
Member

why replaced?

@pguibert6WIND
Copy link
Member Author

why replaced?

it is enlarged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants