-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectd+: Flake/race fix for new channels #5601
connectd+: Flake/race fix for new channels #5601
Conversation
Oh, good catch! I'm not entirely surprised that exiting closes pending fds we've sent that haven't been read yet. (You didn't describe what the user-visible result of this is, though, which would be useful! And a Changelog-Fixed line). This might be fine for 0.12.1 (depending on what this actually fixes!) but I think we want to add something more robust in master. I think lightningd could simply close the fd to the subd after receiving the passed fds: it could wait for that and terminate. |
8f78470
to
e479100
Compare
On my machine (Intel Mac 12.3.1 [21E258]) this script would (sometimes) trigger the problem
Resulting in a user output that showed this error:
Periodically on new channel creation (immediately after hand off to channeld) one side would never receive WIRE_ANNOUNCEMENT_SIGNATURES even though the other side did send it. Resulting in a log like this on the failing side:
The other side would both send & receive WIRE_ANNOUNCEMENT_SIGNATURES. This implies (for the first node) the socket was in a strange state where messages could be sent across it yet not received. The call to read() would return 0 implying a clean shutdown of the connection. At first I thought this was between the two nodes but eventually realized it was the inter-daemon socket from channeld <-> connectd.
Whoops, added it!
Ah I was exploring ways to ack the passed fd and this sounds like a nice simple one. Had the same idea we could push this for 0.12.1 and build out a more robust solution after. The problem is relatively reliably created on mac but my hope is it fix obscure flakiness issues for others. It has that sort of feel to it that it might. |
1) dualopen has fd to connectd 2) channeld needs to take over 3) dualopen passes fd that leads to a connectd over for channeld to use 4) lightningd must receive the fd transfer request and process 5) dualopen shuts down and closes everything it owns 4 & 5 end up in a race. If 5 happens before 4, channeld ends up with an invalid fd for connectd — leaving it in a position to not receive messages. Lingering for a second makes 4 win the race. Since the daemon is closing anyway, waiting for a second should be alright. Changelog-Fixed: Fixed a condition for newly created channels that could trigger a need for reconnect.
e479100
to
78675d2
Compare
Pushed trivial whitespace fix, tagged for 0.12.1 |
4 & 5 end up in a race. If 5 happens before 4, channeld ends up with an invalid fd for connectd — leaving it in a position to not receive messages.
Lingering for a second makes 4 win the race. Since the daemon is closing anyway, waiting for a second should be alright.