Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Robustness of Inbound MPP Claims Across Restart #1434

Conversation

TheBlueMatt
Copy link
Collaborator

This rewrites our claim pipeline to be substantially more robust against particularly poorly timed crashes, including the potentially partial-MPP-claim-causing bug described in Ensure all HTLCs for a claimed payment are claimed on startup!

It also adds some belt-and-suspenders checks when claiming and removes some API bits that I find somewhat confusing and likely hard to use (without replacement, so hopefully users can live without it, otherwise we'll have to revert the last commit and do some really careful documentation).

@valentinewallace
Copy link
Contributor

Wanted to throw out the idea of breaking this PR into the first 3 commits and the last 4 (or so)? I think people have mentioned preferring smaller PRs when feasible

@TheBlueMatt
Copy link
Collaborator Author

Pulled the first two commits into #1435, the rest are all really just "update claim_funds/fail_htlc and then some tests".

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch 2 times, most recently from 721c415 to 49fe7c9 Compare April 20, 2022 19:33
@TheBlueMatt
Copy link
Collaborator Author

Remaining issue: (a) claim payment, (b) crash, (c) restart see PaymentReceived + PaymentClaimed events still in processing queue but payment has been claimed (on-chain), (d) sender sends again (e) call claim_funds in response to the dup PaymentRecived claiming the payment duplicatively. Not a big deal but should fix it by dropping the PaymentReceived event from the event queue on startup if we do a re-claim.

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch from 49fe7c9 to 4e24f02 Compare April 21, 2022 20:54
lightning/src/chain/channelmonitor.rs Outdated Show resolved Hide resolved
payment_preimage: Some(payment_preimage),
}
});
}
match self.claim_funds_from_hop(channel_state.as_mut().unwrap(), htlc.prev_hop, payment_preimage) {
ClaimFundsFromHop::MonitorUpdateFail(pk, err, _) => {
if let msgs::ErrorAction::IgnoreError = err.err.action {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we get a permanent failure here (but claimed_any_htlcs is true), will the permfailed monitor be able to claim the htlc on-chain? Just concerned it won't have learned the preimage in this case

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I mean there's no good answer here (and this PR doesn't change that). TBH I think we need to rethink the whole perm-failure logic a bit. That's kinda the intent behind #1106, but ultimately the "what does perm-failure mean" topic isn't super well-defined currently. Lets leave this for a followup because there's nothing changed here, but, does perm-failure mean "we updated the in-memory copy and nothing else" or "we failed to update anything" needs addressing.

lightning/src/util/events.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Show resolved Hide resolved
lightning/src/ln/channel.rs Outdated Show resolved Hide resolved
lightning/src/chain/channelmonitor.rs Show resolved Hide resolved
if let Some(channel) = by_id.get_mut(&previous_channel_id) {
channel.claim_htlc_while_disconnected_dropping_mon_update(claimable_htlc.prev_hop.htlc_id, payment_preimage, &args.logger);
}
if let Some(previous_hop_monitor) = args.channel_monitors.get(&claimable_htlc.prev_hop.outpoint) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not worth checking that this is the same monitor as from L6781, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My line numbers aren't matching up, but I dont think there's any harm in providing extra preimages for monitors in this super-rare case. They'll get pruned anyway, but even if not it wouldn't matter.


nodes[0].node.handle_update_fulfill_htlc(&nodes[2].node.get_our_node_id(), &cs_updates.update_fulfill_htlcs[0]);
commitment_signed_dance!(nodes[0], nodes[2], cs_updates.commitment_signed, false, true);
expect_payment_sent!(nodes[0], payment_preimage);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kinda throwing me off that nodes[3] is not claiming the payment for the persisted monitor backwards, only the crashed monitor. To check my understanding, the only difference would be that there would be no PaymentSent generated here, it'd only be generated for the first successful backwards claim, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, just on-chain. I didn't bother writing the on-chain eforcement checks cause they're a pain and we have lots of coverage there already.

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch from 4e24f02 to 3c8c013 Compare April 22, 2022 18:06
Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through each commit, largely looks good.

///
/// # Note
/// LDK will not stop an inbound payment from being paid multiple times, so multiple
/// `PaymentReceived` events may be generated for the same payment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/PaymentReceived/PaymentClaimed iiuc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's confusing, I did mean to leave it as-is but added a second sentence.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/functional_tests.rs Outdated Show resolved Hide resolved
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch from 3c8c013 to 6c6d939 Compare April 25, 2022 16:41
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
for (_, monitor) in args.channel_monitors.iter() {
for (payment_hash, payment_preimage) in monitor.get_stored_preimages() {
if let Some(claimable_htlcs) = claimable_htlcs.remove(&payment_hash) {
log_info!(args.logger, "Re-claimaing HTLCs with payment hash {} due to partial-claim.", log_bytes!(payment_hash.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Re-claiming

It took me a second to figure out how we know this is a re-claim, and definitely not a new claim. Is it because the monitor knows the preimage yet the htlc is still in the claimable_htlcs map, so the monitor must be ahead of the ChannelManager?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, yea. Technically its possible its a duplicate payment and the previous payment was claimed but not fully revoked before restart, but its not really worth checking for that case IMO. I updated the log line a bit.

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved
return;
}
if claimable_amt_msat != expected_amt_msat.unwrap() {
log_info!(self.logger, "Attempted to claim an incomplete payment, expected {} msat, had {} available to claim.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the API be more ergonomic if we returned a Result<(), Error> from this method, for cases like this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that sends the wrong impression - an Ok(()) does not imply that the payment is fully claimed, which is ultimately why the bool was removed from the return value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we expect these errors to be super rare? They do seem useful. Returning an Option<Error> is weird though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, basically an error should only occur in a strange restart case of if a user took several blocks to try to claim a payment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like that sends the wrong impression - an Ok(()) does not imply that the payment is fully claimed, which is ultimately why the bool was removed from the return value.

Wouldn't we want to return the Err variant in this case? User attempts to claim a payment that is no longer claimable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I feel like that's confusing - you call claim_funds and get an Err, a user would reasonably assume that the payment was not claimed/is still pending, but in fact it could have been claimed!

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch 2 times, most recently from b8c99f5 to bf9ec6d Compare April 28, 2022 02:46
@TheBlueMatt
Copy link
Collaborator Author

Rebased after merge of #1435.

Copy link
Contributor

@valentinewallace valentinewallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK after CI and another reviewer

lightning/src/ln/functional_tests.rs Outdated Show resolved Hide resolved
return;
}
if claimable_amt_msat != expected_amt_msat.unwrap() {
log_info!(self.logger, "Attempted to claim an incomplete payment, expected {} msat, had {} available to claim.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we expect these errors to be super rare? They do seem useful. Returning an Option<Error> is weird though

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch 2 times, most recently from 43907a8 to 48e5842 Compare May 4, 2022 18:28
@TheBlueMatt
Copy link
Collaborator Author

Ended up having to rewrite this to some extent as we're moving towards removing the FinalOnionHopData from OnionPayload::Invoice which this was relying on. This now starts with two opening commits which extend that migration and adds another field tracking in the claimable HTLCs. See the commit messages on those for more info.

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch from 48e5842 to b4cf258 Compare May 5, 2022 19:24
@codecov-commenter
Copy link

codecov-commenter commented May 5, 2022

Codecov Report

Merging #1434 (531d6c8) into main (36817e0) will increase coverage by 1.19%.
The diff coverage is 88.71%.

@@            Coverage Diff             @@
##             main    #1434      +/-   ##
==========================================
+ Coverage   90.90%   92.10%   +1.19%     
==========================================
  Files          77       80       +3     
  Lines       42201    53330   +11129     
  Branches    42201    53330   +11129     
==========================================
+ Hits        38364    49120   +10756     
- Misses       3837     4210     +373     
Impacted Files Coverage Δ
lightning/src/util/events.rs 41.22% <26.47%> (+7.32%) ⬆️
lightning/src/ln/channelmanager.rs 87.92% <75.16%> (+3.25%) ⬆️
lightning/src/ln/functional_test_utils.rs 97.14% <96.66%> (+1.59%) ⬆️
lightning/src/ln/functional_tests.rs 98.30% <98.91%> (+1.18%) ⬆️
lightning/src/chain/chainmonitor.rs 97.96% <100.00%> (+0.04%) ⬆️
lightning/src/chain/channelmonitor.rs 92.02% <100.00%> (+0.96%) ⬆️
lightning/src/ln/chanmon_update_fail_tests.rs 97.77% <100.00%> (+0.01%) ⬆️
lightning/src/ln/channel.rs 91.70% <100.00%> (+3.12%) ⬆️
lightning/src/ln/monitor_tests.rs 100.00% <100.00%> (ø)
lightning/src/ln/onion_route_tests.rs 97.51% <100.00%> (ø)
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 36817e0...531d6c8. Read the comment docs.

@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch from f9515c4 to 7d4faf6 Compare May 25, 2022 23:40
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch 2 times, most recently from 0fdabf5 to fb7f2b6 Compare May 26, 2022 00:27
While the HTLC-claim process happens across all MPP parts under one
lock, this doesn't imply that they are claimed fully atomically on
disk. Ultimately, an application can crash after persisting one
`ChannelMonitorUpdate` out of multiple monitor updates needed for
the full claim.

Previously, this would leave us in a very bad state - because of
the all-channels-available check in `claim_funds` we'd refuse to
claim the payment again on restart (even though the
`PaymentReceived` event will be passed to the user again), and we'd
end up having partially claimed the payment!

The fix for the consistency part of this issue is pretty
straightforward - just check for this condition on startup and
complete the claim across all channels/`ChannelMonitor`s if we
detect it.

This still leaves us in a confused state from the perspective of
the user, however - we've actually claimed a payment but when they
call `claim_funds` we return `false` indicating it could not be
claimed.
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch from fb7f2b6 to 319df3a Compare May 26, 2022 00:53
@TheBlueMatt
Copy link
Collaborator Author

Squashed without changes:

$ git diff-tree -U1 fb7f2b61 319df3ac
$

@@ -6827,7 +6839,7 @@ impl<'a, Signer: Sign, M: Deref, T: Deref, K: Deref, F: Deref, L: Deref>
for (_, monitor) in args.channel_monitors.iter() {
for (payment_hash, payment_preimage) in monitor.get_stored_preimages() {
if let Some(claimable_htlcs) = claimable_htlcs.remove(&payment_hash) {
log_info!(args.logger, "Re-claimaing HTLCs with payment hash {} due to partial-claim.", log_bytes!(payment_hash.0));
log_info!(args.logger, "Re-claiming HTLCs with payment hash {} as we've released the preimage to a ChannelMonitor!", log_bytes!(payment_hash.0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is in the wrong commit but wtv

This replaces the return value of `claim_funds` with an event. It
does not yet change behavior in any material way.
If we crashed during a payment claim and then detected a partial
claim on restart, we should ensure the user is aware that the
payment has been claimed. We do so here by using the new
partial-claim detection logic to create a `PaymentClaimed` event.
As additional sanity checks, before claiming a payment, we check
that we have the full amount available in `claimable_htlcs` that
the payment should be for. Concretely, this prevents one
somewhat-absurd edge case where a user may receive an MPP payment,
wait many *blocks* before claiming it, allowing us to fail the
pending HTLCs and the sender to retry some subset of the payment
before we go to claim. More generally, this is just good
belt-and-suspenders against any edge cases we may have missed.
`ChannelManager::fail_htlc_backwards`' bool return value is quite
confusing - just because it returns false doesn't mean the payment
wasn't (already) failed. Worse, in some race cases around shutdown
where a payment was claimed before an unclean shutdown and then
retried on startup, `fail_htlc_backwards` could return true even
though (a duplicate copy of the same payment) was claimed, but the
claim event has not been seen by the user yet.

While its possible to use it correctly, its somewhat confusing to
have a return value at all, and definitely lends itself to misuse.

Instead, we should push users towards a model where they don't care
if `fail_htlc_backwards` succeeds - either they've locally marked
the payment as failed (prior to seeing any `PaymentReceived`
events) and will fail any attempts to pay it, or they have not and
the payment is still receivable until its timeout time is reached.

We can revisit this decision based on user feedback, but will need
to very carefully document the potential failure modes here if we
do.
@TheBlueMatt TheBlueMatt force-pushed the 2022-04-robust-payment-claims branch from cc25213 to 8a56702 Compare May 28, 2022 00:02
@TheBlueMatt
Copy link
Collaborator Author

Squashed, only change since yesterday was taking @jkczyz's suggested wording from #1434 (comment)

@TheBlueMatt
Copy link
Collaborator Author

Oops lol, duh, force-pushed the last commit to actually change the references to amt not just the definitions.

@valentinewallace valentinewallace merged commit a534a5e into lightningdevkit:main May 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants