-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DHCP_RELAY] [IPv6] [202012] Failed to bind socket to link local ipv6 address #11431
Comments
@vivekrnv Are both global and link-local addresses available for the interface? |
Hi @kellyyeh, dhcp6relay failed to bind even after 6 retries here. https://github.com/sonic-net/sonic-dhcp-relay/blob/master/src/relay.cpp#L438 . I think This seems weird since the retry mechanism waits for almost 5 secs between every try and i.e. 30 secs overall. I would've expected the Link local address to be a part of the kernel data structures by then. |
Had the bind to global address failed, i would've seen this log https://github.com/sonic-net/sonic-dhcp-relay/blob/master/src/relay.cpp#L434 i suppose. Since, i haven't, i'm assuming bind to global address did not fail |
@vivekrnv what's the cause for the missing ipv6 link local address on interface? Six retries should be enough for the interface to come up and addresses to be available |
I agree that six retries should be good enough, but i've no clue as to why it isn't available. It's hard to repro actually and is only seen during boot. I kinda have other priorities now so i can't get that info immediately. But i've noticed this, this script, https://github.com/Azure/sonic-buildimage/blob/master/dockers/docker-dhcp-relay/wait_for_intf.sh.j2 only checks for IPv4 prefixes. Shouldn't it also check for IPv6 prefixes? |
@vivekrnv That is a great suggestion. however, we do need to be careful here. I think we do need to know if we have IPv6 and/or IPv4 configuration for the wait_for_intf.sh script so that it can wait for the right things to show up. |
@yxieca - Can you please share the PR that fix this issue ? |
Yup, i agree. We check the STATE_DB to determine if the state is up. I'm not sure if that can be used for IPv6 Link local addresses though. If i'm not wrong, a IPv6 link local address is automatically assigned to netdev object by the kernel (or it is probably controlled by SDK, i'm not certain here) after the creation of Vlan here https://github.com/Azure/sonic-swss/blob/master/cfgmgr/vlanmgr.cpp#L122 |
The link local address is added to the vlan iface after a member is added to it. I guess kernel adds the link local address only after a member is added Also, the info is not populated in STATE_DB and so how would you suggest in updating this script https://github.com/sonic-net/sonic-buildimage/blob/202205/dockers/docker-dhcp-relay/wait_for_intf.sh.j2 to check for this interface before starting the dhcp6relay |
@vivekrnv are you going to add a fix to address this issue? |
The fix i thought of adding is to add a check to the wait_for_intf.sh script to wait until the link local address is assigned to the Vlan Iface. And as i've mentioned in the above comment, it's not populated in the STATE_DB and thus has to find other ways. But i guess @kellyyeh should take it up if she can |
@kellyyeh Since binding operation is critical to us, is it possible we set SO_REUSEADDR for binding socket. With this we can avoid some case that LLA already bind by other unknown app and still keep dhcpv6 relay work. |
@jcaiMR dhcpv6 server needs to know the lla for address assignment, so link local is a must in our case |
…k local addr (#12273) - Why I did it Fixes #11431 - How I did it dhcp6relay binds to ipv6 addresses configured on these vlan interfaces Thus check if they are ready before launching dhcp6relay - How to verify it Unit Tests Tested on a live device Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
…k local addr (#12273) - Why I did it Fixes #11431 - How I did it dhcp6relay binds to ipv6 addresses configured on these vlan interfaces Thus check if they are ready before launching dhcp6relay - How to verify it Unit Tests Tested on a live device Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
…k local addr (#12273) - Why I did it Fixes #11431 - How I did it dhcp6relay binds to ipv6 addresses configured on these vlan interfaces Thus check if they are ready before launching dhcp6relay - How to verify it Unit Tests Tested on a live device Signed-off-by: Vivek Reddy Karri <vkarri@nvidia.com>
Description
When started, dhcp6relay tries to bind the socket to ipv6 link local address.
However, in rare scenarios, there might be a race condition b/w the IPv6 device in the kernel becoming oper up and binding the socket to the iface
If the former doesn't happen before the latter the socket bind fails.
There is a retry mechanism in the dhcp6relay which tries for 6 times before giving up but it doesn't seem to be sufficient
Steps to reproduce the issue:
I couldn't repro in manually but is seen during the po/test_po_update.py test and on 202012 image
py.test pc/test_po_update.py --inventory="../ansible/inventory,../ansible/veos" --host-pattern r-tigris-04 --module-path ../ansible/library/ --testbed r-tigris-04-t0-64 --testbed_file ../ansible/testbed.csv --allow_recover --assert plain --log-cli-level debug --show-capture=no -ra --showlocals --topology t0,any,util --skip_sanity
Describe the results you received:
Describe the results you expected:
Error log should not be seen and dhcp6relay should wait until the iface is ready
Output of
show version
:Output of
show techsupport
:Analysis:
Firstly, i don't completely understand reason in the delay for the link local address to appear on Vlan1000 netdev. BTW, this is assigned/controlled by kernel. Not managed by SONiC.
Now given that this happens (although very rarely), dhcp6relay should be robust enough to handle this. just doing "six" retries is not a good solution.
Another interesting observation here is that the diff between retry 5 and 6, the time diff should've been 5 sec but it's almost 40 sec. It's likely because of getifaddr call being extremely slow sometimes. There are a few instances elsewhere reported on the slowness of this system call
Additional information you deem important (e.g. issue happens only occasionally):
syslog.gz
The text was updated successfully, but these errors were encountered: