Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chassis: Orchagent crashes are seen in Voq chassis while running sonic-mgmt PC and voq suites #20507

Closed
saksarav-nokia opened this issue Oct 15, 2024 · 6 comments
Assignees
Labels
Triaged this issue has been triaged

Comments

@saksarav-nokia
Copy link
Contributor

Description

With PR sonic-net/sonic-swss#3269, the orchagent crashes are seen while running sonic-mgmt PC and Voq suites.

Steps to reproduce the issue:

  1. Run PC and Voq suites with latest master

Describe the results you received:

Orchagent crashed multiple times

Describe the results you expected:

No crashes

Output of show version:

(paste your output here)

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@saksarav-nokia
Copy link
Contributor Author

Already discussed the issue and fix with @arlakshm and @abdosi . Testing the fix

@vdahiya12 vdahiya12 added the Triaged this issue has been triaged label Oct 23, 2024
@kenneth-arista
Copy link
Contributor

@saksarav-nokia can you paste into this issue the crash backtrace. We suspect that you may be encountering a similar backtrace to what is documented here #20605

@saksarav-nokia
Copy link
Contributor Author

@kenneth-arista
t 28 19:45:49.253094 ixre-egl-board1 NOTICE syncd1#syncd: [07:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 15 if_id 536923493
2024 Oct 28 19:45:49.253400 ixre-egl-board1 NOTICE syncd0#syncd: [06:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 38 if_id 536923495
2024 Oct 28 19:45:49.253738 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed next hop 3.3.3.31 on ixre-egl-board27|asic0|Ethernet-IB0
2024 Oct 28 19:45:49.254133 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed next hop 3.3.3.31 on ixre-egl-board27|asic0|Ethernet-IB0
2024 Oct 28 19:45:49.254564 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0
2024 Oct 28 19:45:49.255134 ixre-egl-board1 NOTICE syncd1#syncd: [07:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 16 if_id 536923500
2024 Oct 28 19:45:49.255147 ixre-egl-board1 NOTICE swss1#nbrmgrd: :- delKernelRoute: IPv4 Route Del cmd: /sbin/ip route del 3.3.3.31/32
2024 Oct 28 19:45:49.255440 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed next hop 3333::3:19 on ixre-egl-board27|asic0|Ethernet-IB0
2024 Oct 28 19:45:49.255519 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0
2024 Oct 28 19:45:49.256189 ixre-egl-board1 NOTICE swss0#nbrmgrd: :- delKernelRoute: IPv4 Route Del cmd: /sbin/ip route del 3.3.3.31/32
2024 Oct 28 19:45:49.256231 ixre-egl-board1 NOTICE syncd0#syncd: [06:00.0] SAI_API_NEXT_HOP:brcm_sai_remove_next_hop:441 Removing nhid 39 if_id 536923502
2024 Oct 28 19:45:49.256231 ixre-egl-board1 NOTICE swss1#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0
2024 Oct 28 19:45:49.256375 ixre-egl-board1 ERR swss1#orchagent: :- meta_generic_validation_remove: object 0x104000000002165 reference count is 672, can't remove
2024 Oct 28 19:45:49.256444 ixre-egl-board1 ERR swss1#orchagent: :- removeNeighbor: Failed to remove next hop 10.0.0.163 on ixre-egl-board27|asic0|Ethernet120, rv:-17
2024 Oct 28 19:45:49.256502 ixre-egl-board1 ERR swss1#orchagent: :- handleSaiRemoveStatus: Encountered failure in remove operation, exiting orchagent, SAI API: SAI_API_NEXT_HOP, status: SAI_STATUS_OBJECT_IN_USE
2024 Oct 28 19:45:49.256527 ixre-egl-board1 NOTICE swss1#orchagent: :- notifySyncd: sending syncd: SYNCD_INVOKE_DUMP
2024 Oct 28 19:45:49.256751 ixre-egl-board1 NOTICE syncd1#syncd: :- processNotifySyncd: Invoking SAI failure dump
2024 Oct 28 19:45:49.256934 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed next hop 3333::3:19 on ixre-egl-board27|asic0|Ethernet-IB0
2024 Oct 28 19:45:49.257924 ixre-egl-board1 NOTICE swss0#orchagent: :- removeNeighbor: Removed neighbor 40:7c:7d:bb:25:ab on ixre-egl-board27|asic0|Ethernet-IB0

@arlakshm
Copy link
Contributor

@saksarav-nokia has PR to fix this.

@saksarav-nokia
Copy link
Contributor Author

The IMM has two asics and has 2 pot channels in each asic and 2 port members in each port channel.
The ip address is configured on each port channel and bgp is eanbled. The neighbor and routes are learned on these port channel.
In sonic-mgmt pc suite, the test case po-update removes the port members from one of the port channel, removes the ip address configured on that port channel, creates new port channel, adds the same port members to the new port channel, adds the same ip address to the new port channel.
In the remote asic, before all the routes learned on the old port channel are removed by routeOrch, the neighbor and nexthop for the old portchannel are being attempted to be removed. But since the routes are pending, the old nexthop and neighbor are not removed. Then the neighbor and nexthop for the new port channel are being added. If the neighbor is learned on remote system port in remote asic, the nexthop is added with alias as inband port's alias, so the key (ip,alias) is same for both old nexthop and new nexthop. When the new nexthop is added , it calls hasNextHop function to check if the nexthop with (ip-address, alias) as key and since the old nexthop is not removed yet, the hasNextHop returns true, however the assert(!hasNextHop) does n't trigger the crash. So addNextHop function replace the old nexthop with old rif-id with new nexthop with new old rif-id in the nexthop map. Then after all the routes learned on old port channel is removed, the old neighbor and old nexthop are removed. Sine the old nexthop was replaced with new nexthop, when orchagent tries to delete the old nexthop, it actually deletes the new nexthop from SAI. Then when it tries to remove the old neighbor, SAI returns error since orchagent removed the new nexthop from SAI instead of old nexthop and old neighbor is still referenced by the old nexthop in SAI. So orchagent crashes when SAI returns error.
the same issue is seen when the config reload is done in remote IMM or sometimes even with reboot.

@saksarav-nokia
Copy link
Contributor Author

Fixed by sonic-net/sonic-swss#3329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triaged this issue has been triaged
Projects
Status: Done
Development

No branches or pull requests

4 participants