-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FRR 8.2.2] BGP sessions are unable to establish due to interface IP not recognized #12380
Comments
Alternative repro steps: download/build vs image from recent PR build but without including PR #12381. #12381 seems to be a workaround for this issue that we might be taking for the short term. Deploy a kvm t1 testbed, the run this bash script:
|
@hasan-brcm will help take a look. @yxieca I'd suggest to backout #12381 first to unblock the build test while we are working on the RC |
back out? or are you suggesting take it first? Just FYI, #12381 is not a bullet proof work-around. It does reduce the chance of hitting the issue by at least 10x though. |
Why I did it BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP. This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380. How I did it Delaying starting BGP seems to be a workaround for this issue. However, caution is that this delay might impact warm reboot timing and other timing sequences. This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future. How to verify it Continuously issuing config reload and check BGP session status afterwards. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Why I did it BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP. This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380. How I did it Delaying starting BGP seems to be a workaround for this issue. However, caution is that this delay might impact warm reboot timing and other timing sequences. This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future. How to verify it Continuously issuing config reload and check BGP session status afterwards. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@adyeung @hasan-brcm please be advised that we've merged #12381, if you want to test a fix, it would be better to revert this #12381 in your test. |
We had seen similar issues internally in our lab as well with frr 7.2 and
7.5 releases. The issue was if a netlink dump is triggered while updates
are still happening, the received dump had missing data.
The issues were ranging from ip-address missing (which is the case here) to
invalid vrf binding.
The issue seems mostly from kernel instead of frr. But let me dig more both
from the kernel and frr if there is anything that can be fixed.
|
…interrupted (#12412) Why I did it There is an outstanding FRR issue #12380. This seems to be a known issue but without good fix so far. The root cause is around zebra and kernel netlink interaction. The failure was previously not noticed by zebra. How I did it Port the patch that would make the issue obvious. Signed-off-by: Ying Xie ying.xie@microsoft.com
…interrupted (#12412) Why I did it There is an outstanding FRR issue #12380. This seems to be a known issue but without good fix so far. The root cause is around zebra and kernel netlink interaction. The failure was previously not noticed by zebra. How I did it Port the patch that would make the issue obvious. Signed-off-by: Ying Xie ying.xie@microsoft.com
@hasan-brcm I noticed one thing when debug another issue, for bullseye, the network.service would be restarted for several times during the bootup, I'm not sure if that lead to the ip-address missing issue? For the old kernel, network.service will be stared very early during the system boot up. But I didn't figure out why network.service is restarted for several times during bootup. |
What is the motivation for this PR? FRR issue sonic-net/sonic-buildimage#12380 Verify the bgp sessions' status How did you do it? 1: check all bgp sessions are up 2: inject failure, shutdown fanout physical interface or neighbor port 4: do the test, reset bgp or swss or do the reboot 5: Verify all bgp sessions are up How did you verify/test it? Run the test case Any platform specific information? Supported testbed topology if it's a new test case? T0,T1
What is the motivation for this PR? FRR issue sonic-net/sonic-buildimage#12380 Verify the bgp sessions' status How did you do it? 1: check all bgp sessions are up 2: inject failure, shutdown fanout physical interface or neighbor port 4: do the test, reset bgp or swss or do the reboot 5: Verify all bgp sessions are up How did you verify/test it? Run the test case Any platform specific information? Supported testbed topology if it's a new test case? T0,T1
What is the motivation for this PR? FRR issue sonic-net/sonic-buildimage#12380 Verify the bgp sessions' status How did you do it? 1: check all bgp sessions are up 2: inject failure, shutdown fanout physical interface or neighbor port 4: do the test, reset bgp or swss or do the reboot 5: Verify all bgp sessions are up How did you verify/test it? Run the test case Any platform specific information? Supported testbed topology if it's a new test case? T0,T1
What is the motivation for this PR? FRR issue sonic-net/sonic-buildimage#12380 Verify the bgp sessions' status How did you do it? 1: check all bgp sessions are up 2: inject failure, shutdown fanout physical interface or neighbor port 4: do the test, reset bgp or swss or do the reboot 5: Verify all bgp sessions are up How did you verify/test it? Run the test case Any platform specific information? Supported testbed topology if it's a new test case? T0,T1
What is the motivation for this PR? FRR issue sonic-net/sonic-buildimage#12380 Verify the bgp sessions' status How did you do it? 1: check all bgp sessions are up 2: inject failure, shutdown fanout physical interface or neighbor port 4: do the test, reset bgp or swss or do the reboot 5: Verify all bgp sessions are up How did you verify/test it? Run the test case Any platform specific information? Supported testbed topology if it's a new test case? T0,T1
Description
Recently, many PR test for master/202205 branch has been failing because some BGP sessions were down after config reload or config load_minigraph, or just after deploying minigraph.
The issue has similarity to FRRouting/frr#10404. However, the fix FRRouting/frr@2cf7651 didn't address the issue.
(Check experimental PR #12366)
Steps to reproduce the issue:
Describe the results you received:
Test failed after deploying minigraph, or in any test cases involves config reload/load_minigraph.
Describe the results you expected:
BGP sessions always come up.
Output of
show version
:SONiC Software Version: SONiC.master-12366.159811-344393d99
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: 344393d99
Build date: Wed Oct 12 19:16:01 UTC 2022
Built by: AzDevOps@sonic-build-workers-00284R
Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
sonic_dump_vlab-03_20221012_230519.tar.gz
The text was updated successfully, but these errors were encountered: