connectd+: Flake/race fix for new channels #5601

ddustin · 2022-09-16T22:14:57Z

dualopen has fd to connectd
channeld needs to take over
dualopen passes fd that leads to a connectd over for channeld to use
lightningd must receive the fd transfer request and process it
dualopen shuts down and closes everything it owns

4 & 5 end up in a race. If 5 happens before 4, channeld ends up with an invalid fd for connectd — leaving it in a position to not receive messages.

Lingering for a second makes 4 win the race. Since the daemon is closing anyway, waiting for a second should be alright.

rustyrussell · 2022-09-17T05:54:59Z

Oh, good catch! I'm not entirely surprised that exiting closes pending fds we've sent that haven't been read yet.

(You didn't describe what the user-visible result of this is, though, which would be useful! And a Changelog-Fixed line).

This might be fine for 0.12.1 (depending on what this actually fixes!) but I think we want to add something more robust in master. I think lightningd could simply close the fd to the subd after receiving the passed fds: it could wait for that and terminate.

channeld/channeld.c

ddustin · 2022-09-17T17:00:45Z

You didn't describe what the user-visible result of this is, though, which would be useful!

On my machine (Intel Mac 12.3.1 [21E258]) this script would (sometimes) trigger the problem

source contrib/start_regtest.sh
start_ln
fund_nodes
INVOICE=$(l2-cli invoice 10000 test test | jq -r ".bolt11")
l1-cli pay $INVOICE

Resulting in a user output that showed this error:

Waiting for lightning node funds... found.
Funding channel from node 1 to node 2. Waiting for confirmation... done.
{
   "code": 210,
   "message": "Ran out of routes to try after 1 attempt: see `paystatus`",
   "attempts": [
      {
         "status": "failed",
         "failreason": "Failing a partial payment due to a failed RPC call: {\"code\":204,\"message\":\"failed: WIRE_TEMPORARY_CHANNEL_FAILURE (First peer not ready)\",\"data\":{\"erring_index\":0,\"failcode\":4103,\"failcodename\":\"WIRE_TEMPORARY_CHANNEL_FAILURE\",\"erring_node\":\"03f8eaa056c5c9d390f183457cae5abde8af6cce2adb873b9dc83b44b4a2a81872\",\"erring_channel\":\"103x1x1\",\"erring_direction\":1}}",
         "partid": 0,
         "amount_msat": 10000
      }
   ]
}

Periodically on new channel creation (immediately after hand off to channeld) one side would never receive WIRE_ANNOUNCEMENT_SIGNATURES even though the other side did send it. Resulting in a log like this on the failing side:

2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: peer_out WIRE_ANNOUNCEMENT_SIGNATURES
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: billboard: Funding transaction locked. Waiting for their announcement signatures.
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: billboard: Funding transaction locked. Waiting for their announcement signatures.
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: billboard: Funding transaction locked. Waiting for their announcement signatures.
2022-09-14T22:20:16.504Z INFO    03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: Peer connection lost
2022-09-14T22:20:16.504Z DEBUG   03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-channeld-chan#1: Status closed, but not exited. Killing
2022-09-14T22:20:16.505Z INFO    03b95fdebce03a59f9c1cbdaa7a40b9360a70c214812bd813556cd6a65097a956b-chan#1: Peer transient failure in CHANNELD_NORMAL: channeld: Owning subdaemon channeld died (62208)

The other side would both send & receive WIRE_ANNOUNCEMENT_SIGNATURES. This implies (for the first node) the socket was in a strange state where messages could be sent across it yet not received. The call to read() would return 0 implying a clean shutdown of the connection. At first I thought this was between the two nodes but eventually realized it was the inter-daemon socket from channeld <-> connectd.

And a Changelog-Fixed line

Whoops, added it!

This might be fine for 0.12.1 (depending on what this actually fixes!) but I think we want to add something more robust in master. I think lightningd could simply close the fd to the subd after receiving the passed fds: it could wait for that and terminate.

Ah I was exploring ways to ack the passed fd and this sounds like a nice simple one. Had the same idea we could push this for 0.12.1 and build out a more robust solution after.

The problem is relatively reliably created on mac but my hope is it fix obscure flakiness issues for others. It has that sort of feel to it that it might.

1) dualopen has fd to connectd 2) channeld needs to take over 3) dualopen passes fd that leads to a connectd over for channeld to use 4) lightningd must receive the fd transfer request and process 5) dualopen shuts down and closes everything it owns 4 & 5 end up in a race. If 5 happens before 4, channeld ends up with an invalid fd for connectd — leaving it in a position to not receive messages. Lingering for a second makes 4 win the race. Since the daemon is closing anyway, waiting for a second should be alright. Changelog-Fixed: Fixed a condition for newly created channels that could trigger a need for reconnect.

rustyrussell · 2022-09-18T05:13:24Z

Pushed trivial whitespace fix, tagged for 0.12.1

cdecker reviewed Sep 17, 2022

View reviewed changes

channeld/channeld.c Outdated Show resolved Hide resolved

ddustin force-pushed the ddustin/flake_fix branch from 8f78470 to e479100 Compare September 17, 2022 16:59

rustyrussell force-pushed the ddustin/flake_fix branch from e479100 to 78675d2 Compare September 18, 2022 05:12

rustyrussell added this to the v0.12.1 milestone Sep 18, 2022

rustyrussell added the voodoo Suspect user, issue or code has ancient curse label Sep 18, 2022

rustyrussell merged commit cc206e1 into ElementsProject:master Sep 18, 2022

niftynei mentioned this pull request Sep 22, 2022

multifundchannel - some channels are stuck at CHANNELD_AWAITING_LOCKIN until reconnect #5571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

connectd+: Flake/race fix for new channels #5601

connectd+: Flake/race fix for new channels #5601

ddustin commented Sep 16, 2022

rustyrussell commented Sep 17, 2022

ddustin commented Sep 17, 2022

rustyrussell commented Sep 18, 2022

connectd+: Flake/race fix for new channels #5601

connectd+: Flake/race fix for new channels #5601

Conversation

ddustin commented Sep 16, 2022

rustyrussell commented Sep 17, 2022

ddustin commented Sep 17, 2022

rustyrussell commented Sep 18, 2022