-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LAG keepalive script to reduce lacp session wait during warm-reboot #2806
Conversation
@stepanblyschak please review. You might find this interesting, since you are also focused on reducing lacp session wait period during warm reboot. This is little hacky, but is taking advantage of the fact that packets can still be sent even if teamd and syncd are down, as the physical ports are still up, and sockets can still function. It would help if you also test this in your setup and share the time saved in your tests. |
@vaibhavhd This is an interesting finding. The original motivation of stopping teamd prior to syncd was to have an ability to send last LACPDU before syncd removes netdevs from kernel. Since, in your observation, kernel netdevs live until kexec, I believe this is no longer true. At least for Nvidia it is not true since - sonic-net/sonic-buildimage#2572. But I did not know if other vendor syncd stop removes netdevs or not. Can we swap the teamd, syncd shutdown order to achive similar results? BTW, a bit more general question. I know you're using "control plane assistant" to reply on ARP/NDP while DUT undergoes warm-reboot. Have you considered the same for LACP? |
@stepanblyschak , I tried 201911 image on 2700 and I still see that the last lacpdu is sent when kexec is issued. Can you re-evaluate on your end? I think even if teamd/syncd sequence is swapped, this hack remains helpful as it can send LACPDU even after teamd is down.
Great question! I have been pondering on that idea as well for the last few weeks. It might be possible. However, not w/ the same vxlan tunnel that we use w/ CPA. As vxan tunnel is for vlan ports, and here we are talking about LAG ports which aren't part of a vlan. There are other complicated ways of getting that done that I haven't tried yet. This keepalive hack helped use the whole time of shutdown path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested these changed on 202211 and observe 9-10 sec improvement on Nvidia. Thanks.
scripts/lag_keepalive.py
Outdated
from scapy.config import conf | ||
conf.ipv6_enabled = False | ||
from scapy.all import sendp, sniff | ||
from swsssdk import ConfigDBConnector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
swsssdk
is now getting deprecated. Use swsscommon
.
FYI, on 202211:
admin@arc-switch1004:~$ python3 -c "from swsssdk import ConfigDBConnector"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'swsssdk'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed this, thanks for catching it.
debug "Starting lag_keepalive to send LACPDUs ..." | ||
timeout 300 python ${LAG_KEEPALIVE_SCRIPT} & | ||
# give the lag_keepalive script a chance to get ready (30s) and collect one lacpdu before going down (30s) | ||
sleep 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we run the lag_keepalive.py script in the foreground and in the lag_keepalive.py script do fork() and run lag_keepalive() with a timeout? This way we only wait until LACPDUs are collected, probably much less than 60 sec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavhd this is critical issue to our 202211 release, can you take care for review and merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stepanblyschak I did not prefer to run this on foreground to avoid any odd chance of getting hung by the called process.
Additionally, the wait here has to be minimum 30s to collect LACPDU in worst case. The additional 30s is just a buffer. In my observation some platforms are very slow and just importing from scapy.all import sendp, sniff
takes around 10-15s.
This wait can be optimized (if possible) in future PRs. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vaibhavhd Ok, agree
…-reboot (#2828) Cherry pick of #2806 for 202012 branch A new mechanism is added here to to reduce LAG flap issue during hitless upgrades. Problem being solved: During warm upgrades T0 goes down and with that wait time for LACP session starts. If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen. This script makes sure that LACPDUs are sent in the going down path continuously. How time is saved w/ this mechanism: The lacpsession wait period earlier used to start from when teamd container goes down. New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up. Implementation: When warm-reboot starts, capture LACPDUs sent from all LAG member ports. For this allow 60s of prep + collection time. Start sending LACPDUs w/ ~1s interval. The last LACPDU is sent after all containers are down and kexec is issued. Results: Tested this on different platforms and images. Some results for time saved: BRCM: 201811 -> 202012 --- 18s BRCM: 202012 -> 202012 --- 20s MLNX: 201911 -> 202205 --- 10s
@StormLiangMS can you cherry pick to 202211? |
…2806) A new mechanism is added here to to reduce LAG flap issue during hitless upgrades. Problem being solved: During warm upgrades T0 goes down and with that wait time for LACP session starts. If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen. This script makes sure that LACPDUs are sent in the going down path continuously. How time is saved w/ this mechanism: The lacpsession wait period earlier used to start from when teamd container goes down. New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up. Implementation: When warm-reboot starts, capture LACPDUs sent from all LAG member ports. For this allow 60s of prep + collection time. Start sending LACPDUs w/ ~1s interval. The last LACPDU is sent after all containers are down and kexec is issued. Results: Tested this on different platforms and images. Some results for time saved: BRCM: 201811 -> 202012 --- 18s BRCM: 202012 -> 202012 --- 20s MLNX: 201911 -> 202205 --- 10s MLNX: 202205 -> 202205 --- 10s
…2806) A new mechanism is added here to to reduce LAG flap issue during hitless upgrades. Problem being solved: During warm upgrades T0 goes down and with that wait time for LACP session starts. If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen. This script makes sure that LACPDUs are sent in the going down path continuously. How time is saved w/ this mechanism: The lacpsession wait period earlier used to start from when teamd container goes down. New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up. Implementation: When warm-reboot starts, capture LACPDUs sent from all LAG member ports. For this allow 60s of prep + collection time. Start sending LACPDUs w/ ~1s interval. The last LACPDU is sent after all containers are down and kexec is issued. Results: Tested this on different platforms and images. Some results for time saved: BRCM: 201811 -> 202012 --- 18s BRCM: 202012 -> 202012 --- 20s MLNX: 201911 -> 202205 --- 10s MLNX: 202205 -> 202205 --- 10s
…onic-net#2806) A new mechanism is added here to to reduce LAG flap issue during hitless upgrades. Problem being solved: During warm upgrades T0 goes down and with that wait time for LACP session starts. If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen. This script makes sure that LACPDUs are sent in the going down path continuously. How time is saved w/ this mechanism: The lacpsession wait period earlier used to start from when teamd container goes down. New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up. Implementation: When warm-reboot starts, capture LACPDUs sent from all LAG member ports. For this allow 60s of prep + collection time. Start sending LACPDUs w/ ~1s interval. The last LACPDU is sent after all containers are down and kexec is issued. Results: Tested this on different platforms and images. Some results for time saved: BRCM: 201811 -> 202012 --- 18s BRCM: 202012 -> 202012 --- 20s MLNX: 201911 -> 202205 --- 10s MLNX: 202205 -> 202205 --- 10s
What I did
A new mechanism is added here to to reduce LAG flap issue during hitless upgrades.
Problem being solved:
During warm upgrades T0 goes down and with that wait time for LACP session starts.
If the waittime to refresh LACP session is > 90s then T1 initiates LAG teardown, and as a result dataplane impact is seen.
This script makes sure that LACPDUs are sent in the going down path continuously.
How time is saved w/ this mechanism:
The lacpsession wait period earlier used to start from when teamd container goes down.
New lacpsession wait period starts when kexec in current kernel is issued, and new kernel boots up.
Implementation:
When warm-reboot starts, capture LACPDUs sent from all LAG member ports.
For this allow 60s of prep + collection time.
Start sending LACPDUs w/ ~1s interval.
The last LACPDU is sent after all containers are down and kexec is issued.
Results:
Tested this on different platforms and images. Some results for time saved:
BRCM: 201811 -> 202012 --- 18s
BRCM: 202012 -> 202012 --- 20s
MLNX: 201911 -> 202205 --- 10s
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)