Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FRR 8.2.2] BGP sessions are unable to establish due to interface IP not recognized #12380

Open
yxieca opened this issue Oct 12, 2022 · 6 comments
Assignees

Comments

@yxieca
Copy link
Contributor

yxieca commented Oct 12, 2022

Description

Recently, many PR test for master/202205 branch has been failing because some BGP sessions were down after config reload or config load_minigraph, or just after deploying minigraph.

The issue has similarity to FRRouting/frr#10404. However, the fix FRRouting/frr@2cf7651 didn't address the issue.
(Check experimental PR #12366)

Steps to reproduce the issue:

  1. Create master/202205 PR.

Describe the results you received:

Test failed after deploying minigraph, or in any test cases involves config reload/load_minigraph.

Describe the results you expected:

BGP sessions always come up.

Output of show version:

SONiC Software Version: SONiC.master-12366.159811-344393d99
Distribution: Debian 11.5
Kernel: 5.10.0-12-2-amd64
Build commit: 344393d99
Build date: Wed Oct 12 19:16:01 UTC 2022
Built by: AzDevOps@sonic-build-workers-00284R

Output of show techsupport:

Oct 12 22:59:05.512453 vlab-03 ERR bgp#bgpd[50]: [VX6SM-8YE5W][EC 33554460] 10.0.0.33: nexthop_set failed, resetting connection - intf 0x0
Oct 12 22:59:05.512739 vlab-03 ERR bgp#bgpd[50]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 10.0.0.33, fd 68
Oct 12 22:59:05.647274 vlab-03 INFO bgp#bgpd[50]: [V1CHF-JSGRR] %NOTIFICATION: sent to neighbor 10.0.0.33 5/0 (Neighbor Events Error/Unspecific) 0 bytes 
Oct 12 22:59:10.005396 vlab-03 ERR bgp#bgpd[50]: [VX6SM-8YE5W][EC 33554460] 10.0.0.33: nexthop_set failed, resetting connection - intf 0x0
Oct 12 22:59:10.005558 vlab-03 ERR bgp#bgpd[50]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 10.0.0.33, fd 68
Oct 12 22:59:10.093347 vlab-03 INFO bgp#bgpd[50]: [V1CHF-JSGRR] %NOTIFICATION: sent to neighbor 10.0.0.33 5/0 (Neighbor Events Error/Unspecific) 0 bytes 
Oct 12 22:59:16.023509 vlab-03 ERR bgp#bgpd[50]: [VX6SM-8YE5W][EC 33554460] 10.0.0.33: nexthop_set failed, resetting connection - intf 0x0
Oct 12 22:59:16.023509 vlab-03 ERR bgp#bgpd[50]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 10.0.0.33, fd 68
Oct 12 22:59:16.023509 vlab-03 INFO bgp#bgpd[50]: [V1CHF-JSGRR] %NOTIFICATION: sent to neighbor 10.0.0.33 5/0 (Neighbor Events Error/Unspecific) 0 bytes 
Oct 12 22:59:20.029164 vlab-03 ERR bgp#bgpd[50]: [VX6SM-8YE5W][EC 33554460] 10.0.0.33: nexthop_set failed, resetting connection - intf 0x0
Oct 12 22:59:20.029350 vlab-03 ERR bgp#bgpd[50]: [NQGZV-Y3W62][EC 100663299] bgp_connect_success: bgp_getsockname(): failed for peer 10.0.0.33, fd 68

Additional information you deem important (e.g. issue happens only occasionally):

sonic_dump_vlab-03_20221012_230519.tar.gz

@yxieca
Copy link
Contributor Author

yxieca commented Oct 12, 2022

Alternative repro steps: download/build vs image from recent PR build but without including PR #12381. #12381 seems to be a workaround for this issue that we might be taking for the short term.

Deploy a kvm t1 testbed, the run this bash script:

#!/bin/bash -e

declare -i i

while true; do
    i=i+1
    echo "=== Iteration $i ==="
    sudo config reload -y &>/dev/null
    sleep 120
    bgp=$(show ip bgp sum)
    down=$(echo ${bgp} | grep "Active\|Idle" || true)
    if [[ -n "${down}" ]]; then
        echo "=== Not all BGP sessions are up ==="
        show ip bgp sum
        break
    fi
    echo "=== Success. Sleep a bit before next iteration ==="
    sleep 120
done

@adyeung
Copy link
Collaborator

adyeung commented Oct 12, 2022

@hasan-brcm will help take a look. @yxieca I'd suggest to backout #12381 first to unblock the build test while we are working on the RC

@yxieca
Copy link
Contributor Author

yxieca commented Oct 13, 2022

@hasan-brcm will help take a look. @yxieca I'd suggest to backout #12381 first to unblock the build test while we are working on the RC

back out? or are you suggesting take it first? Just FYI, #12381 is not a bullet proof work-around. It does reduce the chance of hitting the issue by at least 10x though.

lguohan pushed a commit that referenced this issue Oct 13, 2022
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.

This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.

How I did it
Delaying starting BGP seems to be a workaround for this issue.

However, caution is that this delay might impact warm reboot timing and other timing sequences.

This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.

How to verify it
Continuously issuing config reload and check BGP session status afterwards.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
yxieca added a commit that referenced this issue Oct 13, 2022
Why I did it
BGP service has always been starting after interface-config. However, recently we discovered an issue where some BGP sessions are unable to establish due to BGP daemon not able to read the interface IP.

This issue was clearly observed after upgrading to FRR 8.2.2. See more details in #12380.

How I did it
Delaying starting BGP seems to be a workaround for this issue.

However, caution is that this delay might impact warm reboot timing and other timing sequences.

This workaround is reducing the probability of hitting the issue by close to 100X. However, this workaround is not bulletproof as test shows. It is still preferrable to have a proper FRR fix and revert this change in the future.

How to verify it
Continuously issuing config reload and check BGP session status afterwards.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@yxieca
Copy link
Contributor Author

yxieca commented Oct 13, 2022

@adyeung @hasan-brcm please be advised that we've merged #12381, if you want to test a fix, it would be better to revert this #12381 in your test.

@hasan-brcm
Copy link
Contributor

hasan-brcm commented Oct 13, 2022 via email

yxieca added a commit that referenced this issue Oct 16, 2022
…interrupted (#12412)

Why I did it
There is an outstanding FRR issue #12380. This seems to be a known issue but without good fix so far. The root cause is around zebra and kernel netlink interaction. The failure was previously not noticed by zebra.

How I did it
Port the patch that would make the issue obvious.

Signed-off-by: Ying Xie ying.xie@microsoft.com
yxieca added a commit that referenced this issue Oct 25, 2022
…interrupted (#12412)

Why I did it
There is an outstanding FRR issue #12380. This seems to be a known issue but without good fix so far. The root cause is around zebra and kernel netlink interaction. The failure was previously not noticed by zebra.

How I did it
Port the patch that would make the issue obvious.

Signed-off-by: Ying Xie ying.xie@microsoft.com
@StormLiangMS
Copy link
Contributor

@hasan-brcm I noticed one thing when debug another issue, for bullseye, the network.service would be restarted for several times during the bootup, I'm not sure if that lead to the ip-address missing issue? For the old kernel, network.service will be stared very early during the system boot up. But I didn't figure out why network.service is restarted for several times during bootup.

StormLiangMS pushed a commit to sonic-net/sonic-mgmt that referenced this issue Aug 14, 2024
What is the motivation for this PR?
FRR issue sonic-net/sonic-buildimage#12380
Verify the bgp sessions' status

How did you do it?
1: check all bgp sessions are up
2: inject failure, shutdown fanout physical interface or neighbor port
4: do the test, reset bgp or swss or do the reboot
5: Verify all bgp sessions are up
How did you verify/test it?
Run the test case

Any platform specific information?
Supported testbed topology if it's a new test case?
T0,T1
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this issue Aug 14, 2024
What is the motivation for this PR?
FRR issue sonic-net/sonic-buildimage#12380
Verify the bgp sessions' status

How did you do it?
1: check all bgp sessions are up
2: inject failure, shutdown fanout physical interface or neighbor port
4: do the test, reset bgp or swss or do the reboot
5: Verify all bgp sessions are up
How did you verify/test it?
Run the test case

Any platform specific information?
Supported testbed topology if it's a new test case?
T0,T1
mssonicbld pushed a commit to sonic-net/sonic-mgmt that referenced this issue Aug 14, 2024
What is the motivation for this PR?
FRR issue sonic-net/sonic-buildimage#12380
Verify the bgp sessions' status

How did you do it?
1: check all bgp sessions are up
2: inject failure, shutdown fanout physical interface or neighbor port
4: do the test, reset bgp or swss or do the reboot
5: Verify all bgp sessions are up
How did you verify/test it?
Run the test case

Any platform specific information?
Supported testbed topology if it's a new test case?
T0,T1
arista-hpandya pushed a commit to arista-hpandya/sonic-mgmt that referenced this issue Oct 2, 2024
What is the motivation for this PR?
FRR issue sonic-net/sonic-buildimage#12380
Verify the bgp sessions' status

How did you do it?
1: check all bgp sessions are up
2: inject failure, shutdown fanout physical interface or neighbor port
4: do the test, reset bgp or swss or do the reboot
5: Verify all bgp sessions are up
How did you verify/test it?
Run the test case

Any platform specific information?
Supported testbed topology if it's a new test case?
T0,T1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants