Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAG keepalive script to reduce lacp session wait during warm-reboot #2806

Merged
merged 3 commits into from
May 4, 2023

Conversation

vaibhavhd
Copy link
Contributor

What I did

A new mechanism is added here to to reduce LAG flap issue during hitless upgrades.

Problem being solved:

During warm upgrades T0 goes down and with that wait time for LACP session starts.
If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen.
This script makes sure that LACPDUs are sent in the going down path continuously.

How time is saved w/ this mechanism:

The lacpsession wait period earlier used to start from when teamd container goes down.
New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up.

Implementation:

When warm-reboot starts, capture LACPDUs sent from all LAG member ports.
For this allow 60s of prep + collection time.
Start sending LACPDUs w/ ~1s interval.
The last LACPDU is sent after all containers are down and kexec is issued.

Results:

Tested this on different platforms and images. Some results for time saved:

BRCM: 201811 -> 202012 --- 18s
BRCM: 202012 -> 202012 --- 20s
MLNX: 201911 -> 202205 --- 10s

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@vaibhavhd
Copy link
Contributor Author

@stepanblyschak please review. You might find this interesting, since you are also focused on reducing lacp session wait period during warm reboot. This is little hacky, but is taking advantage of the fact that packets can still be sent even if teamd and syncd are down, as the physical ports are still up, and sockets can still function.

It would help if you also test this in your setup and share the time saved in your tests.
This new mechanism saves ~10s for Nvidia platform where I tested.

@stepanblyschak
Copy link
Contributor

@vaibhavhd This is an interesting finding. The original motivation of stopping teamd prior to syncd was to have an ability to send last LACPDU before syncd removes netdevs from kernel. Since, in your observation, kernel netdevs live until kexec, I believe this is no longer true. At least for Nvidia it is not true since - sonic-net/sonic-buildimage#2572. But I did not know if other vendor syncd stop removes netdevs or not. Can we swap the teamd, syncd shutdown order to achive similar results?

BTW, a bit more general question. I know you're using "control plane assistant" to reply on ARP/NDP while DUT undergoes warm-reboot. Have you considered the same for LACP?

@vaibhavhd
Copy link
Contributor Author

vaibhavhd commented Apr 26, 2023

@vaibhavhd This is an interesting finding. The original motivation of stopping teamd prior to syncd was to have an ability to send last LACPDU before syncd removes netdevs from kernel. Since, in your observation, kernel netdevs live until kexec, I believe this is no longer true. At least for Nvidia it is not true since - sonic-net/sonic-buildimage#2572. But I did not know if other vendor syncd stop removes netdevs or not. Can we swap the teamd, syncd shutdown order to achive similar results?

BTW, a bit more general question. I know you're using "control plane assistant" to reply on ARP/NDP while DUT undergoes warm-reboot. Have you considered the same for LACP?

@stepanblyschak , I tried 201911 image on 2700 and I still see that the last lacpdu is sent when kexec is issued. Can you re-evaluate on your end? I think even if teamd/syncd sequence is swapped, this hack remains helpful as it can send LACPDU even after teamd is down.

BTW, a bit more general question. I know you're using "control plane assistant" to reply on ARP/NDP while DUT undergoes warm-reboot. Have you considered the same for LACP?

Great question! I have been pondering on that idea as well for the last few weeks. It might be possible. However, not w/ the same vxlan tunnel that we use w/ CPA. As vxan tunnel is for vlan ports, and here we are talking about LAG ports which aren't part of a vlan. There are other complicated ways of getting that done that I haven't tried yet. This keepalive hack helped use the whole time of shutdown path.
But, if CPA like mechanism works it will prevent boot up path lacp-session stalling as well.

@vaibhavhd vaibhavhd closed this Apr 26, 2023
@vaibhavhd vaibhavhd reopened this Apr 26, 2023
Copy link
Contributor

@stepanblyschak stepanblyschak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested these changed on 202211 and observe 9-10 sec improvement on Nvidia. Thanks.

from scapy.config import conf
conf.ipv6_enabled = False
from scapy.all import sendp, sniff
from swsssdk import ConfigDBConnector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swsssdk is now getting deprecated. Use swsscommon.

FYI, on 202211:

admin@arc-switch1004:~$ python3 -c "from swsssdk import ConfigDBConnector"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'swsssdk'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this, thanks for catching it.

debug "Starting lag_keepalive to send LACPDUs ..."
timeout 300 python ${LAG_KEEPALIVE_SCRIPT} &
# give the lag_keepalive script a chance to get ready (30s) and collect one lacpdu before going down (30s)
sleep 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run the lag_keepalive.py script in the foreground and in the lag_keepalive.py script do fork() and run lag_keepalive() with a timeout? This way we only wait until LACPDUs are collected, probably much less than 60 sec.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd this is critical issue to our 202211 release, can you take care for review and merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanblyschak I did not prefer to run this on foreground to avoid any odd chance of getting hung by the called process.

Additionally, the wait here has to be minimum 30s to collect LACPDU in worst case. The additional 30s is just a buffer. In my observation some platforms are very slow and just importing from scapy.all import sendp, sniff takes around 10-15s.
This wait can be optimized (if possible) in future PRs. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd Ok, agree

@vaibhavhd vaibhavhd requested a review from saiarcot895 May 4, 2023 16:22
@vaibhavhd vaibhavhd merged commit 634ac77 into sonic-net:master May 4, 2023
@vaibhavhd vaibhavhd deleted the lag-keepalive branch May 4, 2023 18:35
vaibhavhd added a commit that referenced this pull request May 9, 2023
…-reboot (#2828)

Cherry pick of #2806 for 202012 branch

A new mechanism is added here to to reduce LAG flap issue during hitless upgrades.

Problem being solved:
During warm upgrades T0 goes down and with that wait time for LACP session starts.
If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen.
This script makes sure that LACPDUs are sent in the going down path continuously.

How time is saved w/ this mechanism:

The lacpsession wait period earlier used to start from when teamd container goes down.
New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up.

Implementation:
When warm-reboot starts, capture LACPDUs sent from all LAG member ports.
For this allow 60s of prep + collection time.
Start sending LACPDUs w/ ~1s interval.
The last LACPDU is sent after all containers are down and kexec is issued.

Results:
Tested this on different platforms and images. Some results for time saved:

BRCM: 201811 -> 202012 --- 18s
BRCM: 202012 -> 202012 --- 20s
MLNX: 201911 -> 202205 --- 10s
@moshemos
Copy link

moshemos commented May 10, 2023

@StormLiangMS can you cherry pick to 202211?

StormLiangMS pushed a commit that referenced this pull request May 11, 2023
…2806)

A new mechanism is added here to to reduce LAG flap issue during hitless upgrades.

Problem being solved:
During warm upgrades T0 goes down and with that wait time for LACP session starts.
If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen.
This script makes sure that LACPDUs are sent in the going down path continuously.

How time is saved w/ this mechanism:

The lacpsession wait period earlier used to start from when teamd container goes down.
New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up.

Implementation:
When warm-reboot starts, capture LACPDUs sent from all LAG member ports.
For this allow 60s of prep + collection time.
Start sending LACPDUs w/ ~1s interval.
The last LACPDU is sent after all containers are down and kexec is issued.

Results:
Tested this on different platforms and images. Some results for time saved:

BRCM: 201811 -> 202012 --- 18s
BRCM: 202012 -> 202012 --- 20s
MLNX: 201911 -> 202205 --- 10s
MLNX: 202205 -> 202205 --- 10s
yxieca pushed a commit that referenced this pull request May 17, 2023
…2806)

A new mechanism is added here to to reduce LAG flap issue during hitless upgrades.

Problem being solved:
During warm upgrades T0 goes down and with that wait time for LACP session starts.
If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen.
This script makes sure that LACPDUs are sent in the going down path continuously.

How time is saved w/ this mechanism:

The lacpsession wait period earlier used to start from when teamd container goes down.
New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up.

Implementation:
When warm-reboot starts, capture LACPDUs sent from all LAG member ports.
For this allow 60s of prep + collection time.
Start sending LACPDUs w/ ~1s interval.
The last LACPDU is sent after all containers are down and kexec is issued.

Results:
Tested this on different platforms and images. Some results for time saved:

BRCM: 201811 -> 202012 --- 18s
BRCM: 202012 -> 202012 --- 20s
MLNX: 201911 -> 202205 --- 10s
MLNX: 202205 -> 202205 --- 10s
pdhruv-marvell pushed a commit to pdhruv-marvell/sonic-utilities that referenced this pull request Aug 23, 2023
…onic-net#2806)

A new mechanism is added here to to reduce LAG flap issue during hitless upgrades.

Problem being solved:
During warm upgrades T0 goes down and with that wait time for LACP session starts.
If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen.
This script makes sure that LACPDUs are sent in the going down path continuously.

How time is saved w/ this mechanism:

The lacpsession wait period earlier used to start from when teamd container goes down.
New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up.

Implementation:
When warm-reboot starts, capture LACPDUs sent from all LAG member ports.
For this allow 60s of prep + collection time.
Start sending LACPDUs w/ ~1s interval.
The last LACPDU is sent after all containers are down and kexec is issued.

Results:
Tested this on different platforms and images. Some results for time saved:

BRCM: 201811 -> 202012 --- 18s
BRCM: 202012 -> 202012 --- 20s
MLNX: 201911 -> 202205 --- 10s
MLNX: 202205 -> 202205 --- 10s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants