Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retransmission with state counters #181

Merged

Conversation

rustyrussell
Copy link
Collaborator

OK, so if we keep rollback and add per-channel reconnect counters, this is what it looks like. I know @adiabat wanted simple retransmit, but I think this is actually simpler.

Caveat: I haven't actually implemented this yet, but I'm about to try to do so...

@rustyrussell rustyrussell force-pushed the retransmission-with-state-counters branch 3 times, most recently from 7657d11 to 96e6c1b Compare May 25, 2017 04:47
Copy link
Collaborator

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On an initial pass, I think this works. It's also much simpler than any of the prior approaches, and I generally think that when designing protocols it's better to be explicit when possible.

I may be able to take a stab at implementing this sometime near the tail end of the next week. When I get to that point, I'll report back with any issues or edge cases I ran into during the implementation process.

Thanks for speccing this out!

* `update_` messages: acknowledged by `revoke_and_ack`.
* `commitment_signed`: acknowledged by `revoke_and_ack`.
* `revoke_and_ack`: acknowledged by `commitment_signed` or `closing_signed`
* `shutdown`: acknowledged by `closing_signed`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we no longer care about retransmitting shutdown messages? I guess it wasn't really necessary as upon reconnection, if one still wishes to close down the channel, the should just send another shutdown message and re-start the fee negotiation workflow.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this should be explicit. We could add it in to the commit counter, but I prefer the simplicity of "always retransmit" and associated "ignore duplicates".

@@ -919,18 +919,24 @@ independent of particular channels; their transmission requirements
are covered there, and other than being transmitted after `init` (like
any message), they are independent of requirements here.

1. type: 136 (`channel_reestablish`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reading this diff, but so far: I dig this! Such an explicit message helps to sidestep the "The Uncertain Signature Problem" all together.

`commitment_signed` messages the receiving node has sent, it MUST send
a set of `update_` messages against the prior commitment transaction
followed by `commitment_signed`, otherwise if `commitments_received`
is not equal to the number of `commitment_signed` messages the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this indicates that the channel states have desynchronized, or the transmitting party had persistence corruption?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, something's badly wrong. It's also a big hint that you only have to handle the two cases, and only then until you've received the revoke_and_ack.

If we add multiple commits-in-flight, this "one less" becomes "N less".


If `commitments_received` is one less than the number of
`commitment_signed` messages the receiving node has sent, it MUST send
a set of `update_` messages against the prior commitment transaction
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean only the outbound send of update_* messages, or both inbound (updates I've made) and outbound (updates you've made)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outbound. If you've missed inbound, the other node will re-xmit because the commitments_received you send is one too low.

`commitments_received` is not equal to the commitment number of the
last `commitment_signed` message the receiving node has sent, it
SHOULD fail the channel.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the receiving node does not remember the updates it has sent ? sending a commitment_signed would be illegal I think

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always resend updates then commitment signed. And counters in reestablish msg make it clear whether it received previous or not.

Copy link
Collaborator

@pm47 pm47 Jun 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that:
"it MUST send a set of update_ messages against the prior commitment transaction followed by commitment_signed"
is actually:
"it MUST re-send the same update_ messages that lead to the prior commitment, followed by the same commitment_signed?

This is particularly important if a node does not simply retransmit the exact same update_ messages as previously sent.

So no you don't mean that!

If we are permissive on what the replacing commitment might be, why don't we just say:

"If commitments_received is one less than the commitment number of the last commitment_signed message the receiving node has sent, it MUST reuse the same commitment number for its next commitment_signed, otherwise if commitments_received is not equal to the commitment number of the last commitment_signed message the receiving node has sent, it SHOULD fail the channel."

I don't get why we would require the receiving node to send a new commitment immediately, when the node might not have anything to send (maybe all htlcs are expired), and it can't be enforced anyway.

@Roasbeef pointed out that in an exact-replay implementation, in case of expired htlcs the counterparty might want to accept them and fail them right away, which is not clear if the spec allows, or if it is considered a protocol violation (BOLT 4 has specific errors for expiry_too_soon but how about an expiry in the past)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. I'll fix this using your wording.

Now, in practice, you have to remember the state when you send commitment_signed, so you probably just retransmit (that's what I do).

Re: expired HTLCs. The receiver will fail with expiry_too_soon, yes. But note that there's only a narrow window where sender would send them: they had to set their deadline when sending the first commitment_signed message, and that would have expired and forced them to go on-chain. Otherwise, they might have received the commitment_signed, and published the commitment transaction containing the expired HTLC then fulfilled.

@sstone
Copy link
Collaborator

sstone commented May 31, 2017

How are forwarded HTLC resent when a node restarts ?

Suppose we're Bob, with the following setup:
Alice <-- channel #ab --> Bob <-- channel #bc --> Carol

Alice and Bob have both signed HTLC #1 and #2.
Bob has forwarded HTLC #1 and #2 but has not signed anything yet.
Bob's sub-system that handles channel #bc restarts.
iiuc Bob and Carol forgets both HTLCs in channel #bc. Upon reconnection, Bob's counters will match Carol's.
-> How are HTLCs #1 and #2 sent to Carol ?

Bob can probably tell than its #ab channel has HTLCs going to #bc that have not been processed by its #bc channel yet, but this is probably more complex to implement than the 'retransmit'' model (where Bob's #bc channel would just resend unacked messages).

@rustyrussell
Copy link
Collaborator Author

rustyrussell commented Jun 1, 2017 via email

@sstone
Copy link
Collaborator

sstone commented Jun 1, 2017

Yes, the "rollback" option mimics the batch protocol: one batch "write" call, vs one write call per send with the "retransmit" option. Same amount of data to write in both cases, but with less calls (though it's probably much less of a win in the case of database calls).

@cdecker cdecker added this to the v1.0 milestone Jun 14, 2017
a redundant `funding_locked` if it receives one.

On reconnection, a node MUST transmit `channel_reestablish`
for each channel, and MUST wait for to receive the other node's
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and MUST wait for to receive

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reconnection, a node MUST transmit channel_reestablish for each channel

Is it true even for not-yet-confirmed channels? Must a node send a channel_reestablish with both counters at zero?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Although I had to fix the definition of the revocations_received; it was defined to be the commit index of last-received revocation, which is undefined. So I added one, and added a revocations_received explanation paragraph.

@@ -958,8 +958,8 @@ for each channel, and MUST wait for to receive the other node's
`channel_reestablish` message before sending any other messages for
that channel. The sending node MUST set `commitments_received` to the
commitment number of the last `commitment_signed` it received, and
MUST set `revocations_received` to the commitment number of the last
`revoke_and_ack` message received.
MUST set `revocations_received` to one greater than the commitment number of the last
Copy link
Collaborator

@pm47 pm47 Jun 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling the fields next_commitment_number and next_revocation_number? It would solve the initialization problem, and would also fit nicely with the new phrasing in 3f9748b.

Edit: nevermind, the sentence confused me a little, but it seems revocations_received will actually always be equal to the number of distinct revoke_and_ack received, so it is fine as is.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually changed it to explicitly be the next indices, then read your updated comment before pushing. It makes some sense, and it's consistent between revocations and commitments: this means the commitment value is 1 greater than in the previous definition though (effectively still the total number received, but counting the signature from open/accept too).

@rustyrussell
Copy link
Collaborator Author

rustyrussell commented Jun 22, 2017 via email

fail the channel if they occur.
channel, and the following requirements do not apply.

On reconnection, a node MUST retransmit `funding_locked` unless it has
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about moving this paragraph below the next one, and rephrase it to:
"If next_local_commitment_number is zero in both the channel_reestablish it sent and received, then the node MUST retransmit funding_locked, otherwise it MUST NOT. On reconnection, a node MUST ignore a redundant funding_locked if it receives one."

Would that work? I find it more straightforward to understand and implement, and I think there is an ambiguity in case of discarded update_* in the current version.

Copy link
Collaborator Author

@rustyrussell rustyrussell Jun 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant! Patched exactly like that. 1, not zero (commit 0 is created during opening).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, right

@rustyrussell rustyrussell force-pushed the retransmission-with-state-counters branch from 5f419c2 to f4d870f Compare June 27, 2017 01:25
@rustyrussell
Copy link
Collaborator Author

I squashed fixme commits, for final review phase.

@rustyrussell rustyrussell force-pushed the retransmission-with-state-counters branch from f4d870f to 1197db0 Compare June 27, 2017 01:56
Copy link
Collaborator

@pm47 pm47 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was a bit reluctant with the "rollback" mechanism (A) described in this PR, because it seemed too complex for small gains. I was more in favor of a more basic "simple replay" mechanism (B), which I found more simple to grasp.

Having implemented the two variants:

  • (B) is indeed conceptually simpler, and easy to debug, but it is actually not easier to implement, because of the need to keep track of all messages. The acknowledgment logic can become quite convoluted (it was in my implementation at least), also filtering out old messages was a real pain and I suspect it opened up a few vulnerabilities
  • (B) is less performant because we need to write all updates to disk
  • the rollback logic of (A) is simpler than I thought initially, and with the latest iterations there is no corner case at initialization which is nice
  • with (A) I still have trouble wrapping my head around the fact that the order of messages is not preserved. For example revoke_and_ack, commitment_signed and shutdown messages can be re-sent in a different order. Even if it appears to work fine, there is still something magical about it. I guess I'll be more confortable with it once I add more test on edge cases.

All things considered I changed my mind and now prefer (A). Thanks for your work!

We can't do that, so allow "write, then send".  That fails on the side of
timing out, rather than having a channel which can't be used.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This adds a message for each channel reconnect (after we've
sent/received `funding_signed`, ie. when we rememeber the channel),
which says exactly how many `commitment_signed` and `revoke_and_ack`
we've received.  Really, we could use one bit for each (they could
only be missing the last one), but better to be clear.

This leaves the "rollback if didn't get commitment_signed"
requirement, but avoids any need to handle update duplicates or wonder
what update number a `commitment_signed` applies to after reconnect.

Many thanks to pm47 and roasbeef especially for constructive feedback
which made this far better than I originally had.

Closes: lightning#172
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
@rustyrussell rustyrussell force-pushed the retransmission-with-state-counters branch from 1197db0 to 7333105 Compare June 27, 2017 21:09
@rustyrussell rustyrussell merged commit ac8b830 into lightning:master Jun 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants