Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fast-reboot] Add a check for warmstart before cleaning up neigh table #1498

Merged
merged 1 commit into from
Nov 18, 2020

Conversation

bingwang-ms
Copy link
Contributor

What I did
Fix sonic-net/sonic-buildimage#5841 and sonic-net/sonic-buildimage#5580

We found that neighbor table loaded by swssconfig from arp.json after fast-reboot is cleared by neighsyncd mistakenly at the initial stage. This PR adds a check for WarmStart before cleaning up, and only do that if WarmStart is enable.

Why I did it
This PR is to fix the issue that arp table is not recovered after fast-reboot.

How I verified it
Verified on Arista-7260, running 201911 image.

  1. Run some test to populate ARP entries on DUT, such as test_fast_reboot
  2. Issue a fast-reboot
  3. Verify the arp.json backed up by fast-reboot-dump.py is loaded and NEIGH_TABLE is restored.
admin@str-7260cx3-acs-2:~$ redis-cli -n 0 keys '*NEIGH_TABLE*'
  1) "NEIGH_TABLE:Vlan1000:192.168.0.231"
  2) "NEIGH_TABLE:Vlan1000:192.168.0.106"
  3) "NEIGH_TABLE:Vlan1000:192.168.1.102"
  ....
509) "NEIGH_TABLE:Vlan1000:192.168.0.47"
510) "NEIGH_TABLE:Vlan1000:192.168.1.92"
py.test --inventory ../ansible/str,../ansible/veos --host-pattern str-7260cx3-acs-2 --module-path ../ansible --testbed vms7-t0-7260-2 --testbed_file ../ansible/testbed.csv --junit-xml=tr.xml --log-cli-level info --collect_techsupport=False --topology=t0,any,util platform_tests/test_advanced_reboot.py::test_fast_reboot
========================================================================================= test session starts =========================================================================================                                                                                                                        
platform_tests/test_advanced_reboot.py::test_fast_reboot 
------------------------------------------------------------------------------------------- live log setup --------------------------------------------------------------------------------------------
PASSED                                                                                                                                                                                          [100%]
------------------------------------------------------------------------------------------ live log teardown ------------------------------------------------------------------------------------------
===================================================================================== 1 passed in 404.24 seconds ======================================================================================

Details if related

…s enable.

This commit is to address the issue that the NEIGH_TABLE loaded by swssconfig
after fast-reboot is cleared by neighsyncd.

Signed-off-by: bingwang <wang.bing@microsoft.com>
psTable->clear();
if (m_warmStartInProgress)
{
psTable->clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this change will affect both neighsyncd and natsyncd. @lguohan is it OK to not clear the table for natsyncd when the dut is warm-rebooting? If not and to limit the change to limit neighsyncd, we can also check the table name to be neighsyncd

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't nat hit same issue if this protection is not there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear if it is required for NAT tables to be cleared unconditionally or not. Needs a bit of digging into how NAT is using this shared class. NAT is using this method for 4 table here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug introduced by #1126 where the refactoring of the library to support multiple tables was calling this psTable->clear unconditionally. The way the fix doing here should be correct. NAT tables or any client should not use the library to flush producer state table in non warm-reboot cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw: The reason we use psTable clear was to make sure the relevant table wouldn’t change in some corner cases after we dumped it to memory, this is required only in warm-reboot/restart case before we dump the table. Also, since the daemon using the library was the producer itself, it was safe to do so. However, this assumption was broken if we use swssconfig to load the table at the same time, which is the case for non warm-reboot cases for arp, nat tables etc. In those cases, the library cleared the requests from swssconfig incorrectly and cause the issues reported.

psTable->clear();
if (m_warmStartInProgress)
{
psTable->clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bug introduced by #1126 where the refactoring of the library to support multiple tables was calling this psTable->clear unconditionally. The way the fix doing here should be correct. NAT tables or any client should not use the library to flush producer state table in non warm-reboot cases.

@qiluo-msft qiluo-msft merged commit fcb6c9d into sonic-net:master Nov 18, 2020
abdosi pushed a commit that referenced this pull request Dec 4, 2020
…s enable. (#1498)

This commit is to address the issue that the NEIGH_TABLE loaded by swssconfig
after fast-reboot is cleared by neighsyncd.

**What I did**
Fix sonic-net/sonic-buildimage#5841 and sonic-net/sonic-buildimage#5580

We found that neighbor table loaded by ```swssconfig``` from ```arp.json``` after ```fast-reboot``` is cleared by ```neighsyncd``` mistakenly at the initial stage. This PR adds a check for ```WarmStart``` before cleaning up, and only do that if ```WarmStart``` is enable.

**Why I did it**
This PR is to fix the issue that arp table is not recovered after fast-reboot.

**How I verified it**
Verified on Arista-7260, running 201911 image.
1. Run some test to populate ARP entries on DUT, such as ```test_fast_reboot```
2. Issue a fast-reboot
3. Verify the ```arp.json``` backed up by ```fast-reboot-dump.py``` is loaded and NEIGH_TABLE is restored.
daall pushed a commit to daall/sonic-swss that referenced this pull request Dec 7, 2020
…s enable. (sonic-net#1498)

This commit is to address the issue that the NEIGH_TABLE loaded by swssconfig
after fast-reboot is cleared by neighsyncd.

**What I did**
Fix sonic-net/sonic-buildimage#5841 and sonic-net/sonic-buildimage#5580

We found that neighbor table loaded by ```swssconfig``` from ```arp.json``` after ```fast-reboot``` is cleared by ```neighsyncd``` mistakenly at the initial stage. This PR adds a check for ```WarmStart``` before cleaning up, and only do that if ```WarmStart``` is enable.

**Why I did it**
This PR is to fix the issue that arp table is not recovered after fast-reboot.

**How I verified it**
Verified on Arista-7260, running 201911 image.
1. Run some test to populate ARP entries on DUT, such as ```test_fast_reboot```
2. Issue a fast-reboot
3. Verify the ```arp.json``` backed up by ```fast-reboot-dump.py``` is loaded and NEIGH_TABLE is restored.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[fast-reboot] arp table will be cleared after swssconfig restores it
5 participants