Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LAG keepalive script to reduce lacp session wait during warm-reboot #2806

Merged
merged 3 commits into from
May 4, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions scripts/fast-reboot
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ STRICT=no
REBOOT_METHOD="/sbin/kexec -e"
ASSISTANT_IP_LIST=""
ASSISTANT_SCRIPT="/usr/local/bin/neighbor_advertiser"
LAG_KEEPALIVE_SCRIPT="/usr/local/bin/lag_keepalive.py"
WATCHDOG_UTIL="/usr/local/bin/watchdogutil"
DEVPATH="/usr/share/sonic/device"
PLATFORM=$(sonic-cfggen -H -v DEVICE_METADATA.localhost.platform)
Expand Down Expand Up @@ -682,6 +683,13 @@ fi
# disable trap-handlers which were set before
trap '' EXIT HUP INT QUIT TERM KILL ABRT ALRM

# start sending LACPDUs to keep the LAGs refreshed
# this is a non-blocking call, and the process will die in 300s
debug "Starting lag_keepalive to send LACPDUs ..."
timeout 300 python ${LAG_KEEPALIVE_SCRIPT} &
# give the lag_keepalive script a chance to get ready (30s) and collect one lacpdu before going down (30s)
sleep 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we run the lag_keepalive.py script in the foreground and in the lag_keepalive.py script do fork() and run lag_keepalive() with a timeout? This way we only wait until LACPDUs are collected, probably much less than 60 sec.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd this is critical issue to our 202211 release, can you take care for review and merge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanblyschak I did not prefer to run this on foreground to avoid any odd chance of getting hung by the called process.

Additionally, the wait here has to be minimum 30s to collect LACPDU in worst case. The additional 30s is just a buffer. In my observation some platforms are very slow and just importing from scapy.all import sendp, sniff takes around 10-15s.
This wait can be optimized (if possible) in future PRs. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vaibhavhd Ok, agree


if [ -x ${LOG_SSD_HEALTH} ]; then
debug "Collecting logs to check ssd health before ${REBOOT_TYPE}..."
${LOG_SSD_HEALTH}
Expand Down
102 changes: 102 additions & 0 deletions scripts/lag_keepalive.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#!/usr/bin/env python3

from scapy.config import conf
conf.ipv6_enabled = False
from scapy.all import sendp, sniff
from swsssdk import ConfigDBConnector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

swsssdk is now getting deprecated. Use swsscommon.

FYI, on 202211:

admin@arc-switch1004:~$ python3 -c "from swsssdk import ConfigDBConnector"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'swsssdk'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this, thanks for catching it.

import time, threading, traceback
import syslog

SYSLOG_ID = 'lag_keepalive'


def log_info(msg):
syslog.openlog(SYSLOG_ID)
syslog.syslog(syslog.LOG_INFO, msg)
syslog.closelog()


def log_error(msg):
syslog.openlog(SYSLOG_ID)
syslog.syslog(syslog.LOG_ERR, msg)
syslog.closelog()


def sniff_lacpdu(device_mac, lag_member, lag_member_to_packet):
sniffed_packet = sniff(iface=lag_member,
filter="ether proto 0x8809 and ether src {}".format(device_mac),
count=1, timeout=30)
lag_member_to_packet[lag_member] = sniffed_packet


def get_lacpdu_per_lag_member():
appDB = ConfigDBConnector()
appDB.db_connect('APPL_DB')
appDB_lag_info = appDB.get_keys('LAG_MEMBER_TABLE')
configDB = ConfigDBConnector()
configDB.db_connect('CONFIG_DB')
device_mac = configDB.get(configDB.CONFIG_DB, "DEVICE_METADATA|localhost", "mac")
hwsku = configDB.get(configDB.CONFIG_DB, "DEVICE_METADATA|localhost", "hwsku")
active_lag_members = list()
lag_member_to_packet = dict()
sniffer_threads = list()
for lag_entry in appDB_lag_info:
lag_name = str(lag_entry[0])
oper_status = appDB.get(appDB.APPL_DB,"LAG_TABLE:{}".format(lag_name), "oper_status")
if oper_status == "up":
# only apply the workaround for active lags
lag_member = str(lag_entry[1])
active_lag_members.append(lag_member)
# use threading to capture lacpdus from several lag members simultaneously
sniffer_thread = threading.Thread(target=sniff_lacpdu,
args=(device_mac, lag_member, lag_member_to_packet))
sniffer_thread.start()
sniffer_threads.append(sniffer_thread)

# sniff for lacpdu should finish in <= 30s. sniff timeout is also set to 30s
for sniffer in sniffer_threads:
sniffer.join(timeout=30)

return active_lag_members, lag_member_to_packet


def lag_keepalive(lag_member_to_packet):
while True:
for lag_member, packet in lag_member_to_packet.items():
try:
sendp(packet, iface=lag_member, verbose=False)
except Exception:
# log failure and continue to send lacpdu
traceback_msg = traceback.format_exc()
log_error("Failed to send LACPDU packet from interface {} with error: {}".format(
lag_member, traceback_msg))
continue
log_info("sent LACPDU packets via {}".format(lag_member_to_packet.keys()))
time.sleep(1)


def main():
while True:
try:
active_lag_members, lag_member_to_packet = get_lacpdu_per_lag_member()
if len(active_lag_members) != len(lag_member_to_packet.keys()):
log_error("Failed to capture LACPDU packets for some lag members. " +\
"Active lag members: {}. LACPDUs captured for: {}".format(
active_lag_members, lag_member_to_packet.keys()))

log_info("ready to send LACPDU packets via {}".format(lag_member_to_packet.keys()))
except Exception:
traceback_msg = traceback.format_exc()
log_error("Failed to get LAG members and LACPDUs with error: {}".format(
traceback_msg))
# keep attempting until sniffed packets are ready
continue
# if no exceptions are thrown, break from loop as LACPDUs are ready to be sent
break

if lag_member_to_packet:
# start an infinite loop to keep sending lacpdus from lag member ports
lag_keepalive(lag_member_to_packet)

if __name__ == "__main__":
main()
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,7 @@
'scripts/intfutil',
'scripts/intfstat',
'scripts/ipintutil',
'scripts/lag_keepalive.py',
'scripts/lldpshow',
'scripts/log_ssd_health',
'scripts/mellanox_buffer_migrator.py',
Expand Down