Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL #2619

Merged
merged 2 commits into from
Jan 19, 2023

Conversation

stephenxs
Copy link
Collaborator

@stephenxs stephenxs commented Jan 13, 2023

What I did

Fix issue sonic-net/sonic-buildimage#13341 the issue that ARP entry is out of sync between kernel and APPL_DB

In AppRestartAssist::insertToMap, in case an entry has been updated more than once with the same value but different from the stored one, keep the state as NEW.

Eg.
Assume the entry's value that is restored from the warm reboot is V0, the following events received.

  1. The first update with value V1 is received and handled by the if (found != appTableCacheMap[tableName].end()) branch,
    • the state is set to NEW
    • value is updated to V1
  2. The second update with the same value V1 is received and handled by this branch
    • Originally, the state was set to SAME, which is wrong because V1 is different from the stored value V0
    • The correct logic should be: set the state to SAME only if the state is not NEW

This is a very rare case because most of times, the entry won't be updated multiple times

Signed-off-by: Stephen Sun stephens@nvidia.com

Why I did it

To fix the issue.

How I verified it

Mock test is added to cover the case.

Manually test. with the fix in, the log message indicates the new ARP entry has been applied to APPL_DB (and below)

Jan 12 13:06:18.800339 mtbc-sonic-01-2410 NOTICE swss#neighsyncd: :- isNeighRestoreDone: neighbor table restore to kernel is done
Jan 12 13:06:18.804322 mtbc-sonic-01-2410 INFO swss#supervisord: neighsyncd Listens to neigh messages...
Jan 12 13:06:18.806320 mtbc-sonic-01-2410 INFO swss#neighsyncd: :- insertToMap: Received message NEIGH_TABLE, key: Vlan1000:192.168.0.2, neigh:24:8a:07:ab:a1:12, family:IPv4, , delete = 0
Jan 12 13:06:18.806647 mtbc-sonic-01-2410 NOTICE swss#neighsyncd: :- insertToMap: NEIGH_TABLE, found key: Vlan1000:192.168.0.2, new value 
Jan 12 13:06:21.684996 mtbc-sonic-01-2410 INFO swss#neighsyncd: :- insertToMap: Received message NEIGH_TABLE, key: Vlan1000:192.168.0.2, neigh:24:8a:07:ab:a1:12, family:IPv4, , delete = 0
Jan 12 13:06:21.684996 mtbc-sonic-01-2410 NOTICE swss#neighsyncd: :- insertToMap: NEIGH_TABLE, found key: Vlan1000:192.168.0.2, it has been updated for the second time, keep state as NEW
Jan 12 13:06:21.684996 mtbc-sonic-01-2410 INFO swss#neighsyncd: :- insertToMap: Received message NEIGH_TABLE, key: Vlan1000:192.168.0.2, neigh:24:8a:07:ab:a1:12, family:IPv4, , delete = 0
Jan 12 13:06:21.684996 mtbc-sonic-01-2410 NOTICE swss#neighsyncd: :- insertToMap: NEIGH_TABLE, found key: Vlan1000:192.168.0.2, it has been updated for the second time, keep state as NEW
Jan 12 13:06:23.800836 mtbc-sonic-01-2410 INFO swss#neighsyncd: :- checkReconcileTimer: warmstart timer expired
Jan 12 13:06:23.800836 mtbc-sonic-01-2410 NOTICE swss#neighsyncd: :- reconcile: NEIGH_TABLE NEW, key: Vlan1000:192.168.0.2, neigh:24:8a:07:ab:a1:12, family:IPv4, cache-state:NEW, 
Jan 12 13:06:23.802468 mtbc-sonic-01-2410 NOTICE swss#neighsyncd: :- setWarmStartState: neighsyncd warm start state changed to reconciled

Details if related

Signed-off-by: Stephen Sun <stephens@nvidia.com>
Signed-off-by: Stephen Sun <stephens@nvidia.com>
@stephenxs stephenxs marked this pull request as ready for review January 14, 2023 09:21
@stephenxs stephenxs requested a review from prsunny as a code owner January 14, 2023 09:21
@stephenxs stephenxs changed the title Fix the issue that ARP entry is out of sync between kernel and APPL_DB Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL Jan 15, 2023
Copy link
Collaborator

@prsunny prsunny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, request review from @yxieca

@yxieca
Copy link
Contributor

yxieca commented Jan 18, 2023

@prsunny do we need this change in 202205?

@liat-grozovik liat-grozovik merged commit cd95972 into sonic-net:master Jan 19, 2023
@prsunny
Copy link
Collaborator

prsunny commented Jan 19, 2023

@prsunny do we need this change in 202205?

i dont think so

@stephenxs stephenxs deleted the fix-double-update-wra branch January 20, 2023 01:29
dprital added a commit to dprital/sonic-buildimage that referenced this pull request Jan 30, 2023
Update sonic-swss submodule pointer to include the following:
* a2a483d [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  ([sonic-net#2617](sonic-net/sonic-swss#2617))
* 9d1f66b [bfdorch] add local discriminator to state DB ([sonic-net#2629](sonic-net/sonic-swss#2629))
* c54b3d1 Vxlan tunnel endpoint custom monitoring APPL DB table. ([sonic-net#2589](sonic-net/sonic-swss#2589))
* 7f03db2 Fix potential risks ([sonic-net#2516](sonic-net/sonic-swss#2516))
* 383ee68 [refactor]Refactoring sai handle status ([sonic-net#2621](sonic-net/sonic-swss#2621))
* cd95972 Fix issue 13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL ([sonic-net#2619](sonic-net/sonic-swss#2619))
* a01470f Remove TODO comments that are no longer relevant ([sonic-net#2622](sonic-net/sonic-swss#2622))
* d058390 Changed the BFD default detect multiplier to 10x ([sonic-net#2614](sonic-net/sonic-swss#2614))
* d78b528 [MuxOrch] Enabling neighbor when adding in active state ([sonic-net#2601](sonic-net/sonic-swss#2601))
* 4ebdad1 [routesync] Fix for stale dynamic neighbor ([sonic-net#2553](sonic-net/sonic-swss#2553))
* 8857f92 Added new attributes for Vnet and Vxlan ecmp configurations. ([sonic-net#2584](sonic-net/sonic-swss#2584))
* b6bbc3e Revert [voq][chassis]Add show fabric counters port/queue commands (2522) ([sonic-net#2611](sonic-net/sonic-swss#2611))
* 52406e2 Add missing parameter to on_switch_shutdown_request method. ([sonic-net#2567](sonic-net/sonic-swss#2567))
* 4ac9ad9 Increase diff coverage to 80% ([sonic-net#2599](sonic-net/sonic-swss#2599))
* 8a0bb36 Handle Mac address 'none' ([sonic-net#2593](sonic-net/sonic-swss#2593))
* f496ab3 [vstest] Only collect stdout of orchagent_restart_check in vstest ([sonic-net#2597](sonic-net/sonic-swss#2597))
* 1dab495 Avoid aborting orchagent when setting TUNNEL attributes ([sonic-net#2591](sonic-net/sonic-swss#2591))
* 4395cea Fix neighbor doesn't update all attribute ([sonic-net#2577](sonic-net/sonic-swss#2577))

Signed-off-by: dprital <drorp@nvidia.com>
liat-grozovik pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Feb 1, 2023
Update sonic-swss submodule pointer to include the following:
* a2a483d [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  ([#2617](sonic-net/sonic-swss#2617))
* 9d1f66b [bfdorch] add local discriminator to state DB ([#2629](sonic-net/sonic-swss#2629))
* c54b3d1 Vxlan tunnel endpoint custom monitoring APPL DB table. ([#2589](sonic-net/sonic-swss#2589))
* 7f03db2 Fix potential risks ([#2516](sonic-net/sonic-swss#2516))
* 383ee68 [refactor]Refactoring sai handle status ([#2621](sonic-net/sonic-swss#2621))
* cd95972 Fix issue 13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL ([#2619](sonic-net/sonic-swss#2619))
* a01470f Remove TODO comments that are no longer relevant ([#2622](sonic-net/sonic-swss#2622))
* d058390 Changed the BFD default detect multiplier to 10x ([#2614](sonic-net/sonic-swss#2614))
* d78b528 [MuxOrch] Enabling neighbor when adding in active state ([#2601](sonic-net/sonic-swss#2601))
* 4ebdad1 [routesync] Fix for stale dynamic neighbor ([#2553](sonic-net/sonic-swss#2553))
* 8857f92 Added new attributes for Vnet and Vxlan ecmp configurations. ([#2584](sonic-net/sonic-swss#2584))
* b6bbc3e Revert [voq][chassis]Add show fabric counters port/queue commands (2522) ([#2611](sonic-net/sonic-swss#2611))
* 52406e2 Add missing parameter to on_switch_shutdown_request method. ([#2567](sonic-net/sonic-swss#2567))
* 4ac9ad9 Increase diff coverage to 80% ([#2599](sonic-net/sonic-swss#2599))
* 8a0bb36 Handle Mac address 'none' ([#2593](sonic-net/sonic-swss#2593))
* f496ab3 [vstest] Only collect stdout of orchagent_restart_check in vstest ([#2597](sonic-net/sonic-swss#2597))
* 1dab495 Avoid aborting orchagent when setting TUNNEL attributes ([#2591](sonic-net/sonic-swss#2591))
* 4395cea Fix neighbor doesn't update all attribute ([#2577](sonic-net/sonic-swss#2577))

Signed-off-by: dprital <drorp@nvidia.com>
StormLiangMS pushed a commit that referenced this pull request Feb 10, 2023
…_DB if multiple updates are received from RTNL (#2619)

- What I did
Fix issue sonic-net/sonic-buildimage#13341 the issue that ARP entry is out of sync between kernel and APPL_DB
In AppRestartAssist::insertToMap, in case an entry has been updated more than once with the same value but different from the stored one, keep the state as NEW.

Eg.
Assume the entry's value that is restored from the warm reboot is V0, the following events received.

The first update with value V1 is received and handled by the if (found != appTableCacheMap[tableName].end()) branch,
* the state is set to NEW
* value is updated to V1
The second update with the same value V1 is received and handled by this branch
Originally, the state was set to SAME, which is wrong because V1 is different from the stored value V0
The correct logic should be: set the state to SAME only if the state is not NEW
This is a very rare case because most of times, the entry won't be updated multiple times

- Why I did it
To fix the issue.

- How I verified it
Mock test is added to cover the case.

Signed-off-by: Stephen Sun <stephens@nvidia.com>
prsunny pushed a commit that referenced this pull request Feb 16, 2023
*Merge remote-tracking branch 'upstream/master' into dash (#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (#2591)
* Handle Mac address 'none' (#2593)
* Increase diff coverage to 80% (#2599)
* Add missing parameter to on_switch_shutdown_request method. (#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (#2522)" (#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (#2553)
* [MuxOrch] Enabling neighbor when adding in active state (#2601)
* Changed the BFD default detect multiplier to 10x (#2614)
* Remove TODO comments that are no longer relevant (#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (#2619)
* [refactor]Refactoring sai handle status (#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (#2617)
* [voq][chassis] Remove created ports from the default vlan. (#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (#2638)
* [hash]: Add UT infra. (#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (#2644)
* [ResponsePublisher] add pipeline support  (#2511)
* [dash] Fix compilation issue caused by missing include.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 21, 2023
)

*Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591)
* Handle Mac address 'none' (sonic-net#2593)
* Increase diff coverage to 80% (sonic-net#2599)
* Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (sonic-net#2553)
* [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601)
* Changed the BFD default detect multiplier to 10x (sonic-net#2614)
* Remove TODO comments that are no longer relevant (sonic-net#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619)
* [refactor]Refactoring sai handle status (sonic-net#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (sonic-net#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (sonic-net#2617)
* [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (sonic-net#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638)
* [hash]: Add UT infra. (sonic-net#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644)
* [ResponsePublisher] add pipeline support  (sonic-net#2511)
* [dash] Fix compilation issue caused by missing include.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 25, 2023
)

*Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591)
* Handle Mac address 'none' (sonic-net#2593)
* Increase diff coverage to 80% (sonic-net#2599)
* Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (sonic-net#2553)
* [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601)
* Changed the BFD default detect multiplier to 10x (sonic-net#2614)
* Remove TODO comments that are no longer relevant (sonic-net#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619)
* [refactor]Refactoring sai handle status (sonic-net#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (sonic-net#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (sonic-net#2617)
* [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (sonic-net#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638)
* [hash]: Add UT infra. (sonic-net#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644)
* [ResponsePublisher] add pipeline support  (sonic-net#2511)
* [dash] Fix compilation issue caused by missing include.
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 25, 2023
)

*Merge remote-tracking branch 'upstream/master' into dash (sonic-net#2663)
* Modify coppmgr mergeConfig to support preserving copp tables through reboot. (sonic-net#2548)
* Avoid aborting orchagent when setting TUNNEL attributes (sonic-net#2591)
* Handle Mac address 'none' (sonic-net#2593)
* Increase diff coverage to 80% (sonic-net#2599)
* Add missing parameter to on_switch_shutdown_request method. (sonic-net#2567)
* Add ZMQ based ProducerStateTable and CustomerStateTable.
* Revert "[voq][chassis]Add show fabric counters port/queue commands (sonic-net#2522)" (sonic-net#2611)
* Added new attributes for Vnet and Vxlan ecmp configurations. (sonic-net#2584)
* added support for monitoring, primary and adv_prefix and overlay_dmac.
* [routesync] Fix for stale dynamic neighbor (sonic-net#2553)
* [MuxOrch] Enabling neighbor when adding in active state (sonic-net#2601)
* Changed the BFD default detect multiplier to 10x (sonic-net#2614)
* Remove TODO comments that are no longer relevant (sonic-net#2622)
* Fix issue #13341 ARP entry can be out of sync between kernel and APPL_DB if multiple updates are received from RTNL (sonic-net#2619)
* [refactor]Refactoring sai handle status (sonic-net#2621)
* Vxlan tunnel endpoint custom monitoring APPL DB table. (sonic-net#2589)
* added support for monitoring, primary and adv_prefix. changed filter_mac to overlay_dmac
* Data Structures and code to write APP_DB VNET_MONITOR table entries for custom monitoring of Vxlan tunnel endpoints.
* [bfdorch] add local discriminator to state DB (sonic-net#2629)
* [acl] Add new ACL key BTH_OPCODE and AETH_SYNDROME  (sonic-net#2617)
* [voq][chassis] Remove created ports from the default vlan. (sonic-net#2607)
* [EVPN]Handling race condition when remote VNI arrives before tunnel map entry (sonic-net#2642)
*Added check in remote VNI add to ensure vxlan tunnel map is created before adding the remote end point.
* [test_mux] add sleep in test_NH (sonic-net#2648)
* [autoneg]Fixing adv interface types to be set when AN is disabled (sonic-net#2638)
* [hash]: Add UT infra. (sonic-net#2660)
*Added UT infra for Generic Hash feature
*Aligned PBH tests with Generic Hash UT infra
* [sai_failure_dump]Invoking dump during SAI failure (sonic-net#2644)
* [ResponsePublisher] add pipeline support  (sonic-net#2511)
* [dash] Fix compilation issue caused by missing include.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants