Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contract court acting on confirmed chain events #1017

Merged

Conversation

halseth
Copy link
Contributor

@halseth halseth commented Apr 3, 2018

This PR changes the way the ChannelArbitrator state machine is triggered, by making the ChainEvents sent from the ChainWatcher only being sent if a commitment is confirmed on-chain. This is done to ensure that if we act on a commitment (force close, breach, cooperative close etc), then the commitment we act on will not be replaced by another, valid commitment in-chain.

ChannelArbitrator

The ChannelArbitrator's state machine is changed to accept any commitment being seen on-chain, regardless what state it is currently in. In particular, we would previously assume a local force close transaction always would be confirmed if we had successfully broadcasted it. Now we broadcast, and wait for any commitment to be accepted into a block before making any more assumptions.

OpenChannel

By having to wait for a commitment to appear in-chain before taking any further action, a new "pending close state" a channel can be in is introduced. We add a field BroadcastedCommitment to the OpenChannel struct, which keeps track of the intermediate state a channel can be in if the commitment has been broadcasted. This field is needed to make sure we don't try to forward payments over such a channel, as it might be our commitment that is broadcasted.

ChainWatcher

The ChainWatcher now only dispatches confirmed events, making all it's clients unaware of any unconfirmed spends.

Fixes #538
Fixes #880
Fixes #1112
Fixes #1071
Fixes #1091

Note: builds on top of #956 and #969, and the integration tests need Roasbeef/btcd#20 to avoid flakes.

TODO before merge:

@meshcollider meshcollider added state machine commitments Commitment transactions containing the state of the channel notifications labels Apr 3, 2018
@halseth halseth force-pushed the contract-court-without-on-chain branch 2 times, most recently from c0c6549 to f0829f9 Compare April 4, 2018 10:26
@Roasbeef
Copy link
Member

Roasbeef commented Apr 6, 2018

One integration tests hit what looks to be a deadlock.

The other instance ran into this error (which may or may not be new?):


    --- FAIL: TestLightningNetworkDaemon/test_multi-hop_htlc_local_chain_claim (21.98s)
    	lnd_test.go:69: Failed: (test multi-hop htlc local chain claim): exited with error: 
    		*errors.errorString transactions not found in mempool: wanted 3, found 2 txs in mempool

(or possibly is an incorrect test that was coded around a prior race condition?)

@halseth
Copy link
Contributor Author

halseth commented Apr 9, 2018

The "transactions not found in mempool" error should be fixed when rebased on btcd with the mempool fix. It is basically an existing race condition that my changes made more likely to occur.

The deadlock I'm not sure about, not seen before 🤔

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This is definitely something that should have been properly implemented when lnd was created even. It should resolve a number of issues we've been running into in the integration tests, and also some weird states we've seen channels get into on mainnet.

Once this PR finally gets in, I'll make a follow up PR that modifies things to ensure we broadcast the commitment accordingly in the face of channel desynchronization.

@@ -1374,17 +1374,6 @@ func (c *ChannelArbitrator) channelAttendant(bestHeight int32,
CommitResolution: uniClosure.CommitResolution,
HtlcResolutions: *uniClosure.HtlcResolutions,
}
if contractRes.IsEmpty() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the simplification here!

// StateCommitmentBroadcasted is a state that indicates that the
// attendant has broadcasted the commitment transaction, and is now
// waiting for it to confirm.
StateCommitmentBroadcasted ArbitratorState = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is now, this'll break any dormant nodes that are in the StateContractClosed state. We didn't use iota here as we need the enums to be stable on disk. instead this should be added as the final state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1475,6 +1475,7 @@ func (c *ChannelArbitrator) channelAttendant(bestHeight int32,
log.Infof("ChannelArbitrator(%v): all "+
"contracts resolved, exiting",
c.cfg.ChanPoint)
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I think I've seen a few lingering chan arbitrators that never exited, I think this was the souce

@@ -307,6 +311,13 @@ func (c *chainWatcher) closeObserver(spendNtfn *chainntnfs.SpendEvent) {
&commitmentHash,
)
if isOurCommitment {
if err := c.dispatchLocalClose(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming suggestion: dispatchLocalForceClose.

if err != nil {
return err
}
// With the event processed, we'll now notify all subscribers of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing new line above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

if err := net.DisconnectNodes(ctxt, net.Alice, net.Bob); err != nil {
t.Fatalf("unable to disconnect Bob's peer from Alice's: err %v", err)
var predErr error
err = lntest.WaitPredicate(func() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why's this necessary now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we now don't consider the channel closed (adding the CloseSummary to the db) before the commitment is on-chain, we run into a race with the above force close. Alternative is to modify the DisconnectPeer rpc to allow disconnecting from a peer where the commitment has been broadcasted, but not confirmed yet?

}

// TODO(halseth): create new clients that should be registered after tx
// is in the mempool already, when btcd supports notifying on these.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btcd supports this now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue created to address this.

@@ -1551,6 +1596,7 @@ type ChannelCloseSummary struct {

// CloseType details exactly _how_ the channel was closed. Three
// closure types are possible: cooperative, force, and breach.
// TODO: migration logic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No migration logic needed if we append the new close types to the end.

case <-c.cfg.ChainEvents.CooperativeClosure:
log.Infof("ChannelArbitrator(%v) closing due to co-op "+
"closure", c.cfg.ChanPoint)
return

// We have broadcasted our commitment, and it is now confirmed
// on-chain.
case closeSummary := <-c.cfg.ChainEvents.LocalUnilateralClosure:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would need to double check, but I think it's the case that for all of these events, we need to ensure we always pass in the block height of the event. Otherwise, we may end up writing the incorrect height hint to disk, which has ramifications for neutrino (possibly missing an event).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now using the spendingHeight when advancing state. Also changed the variable name to triggerHeight to make it more clear.

// }
}

func TestChannelArbitratorCooperativeClose(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to at least finish this pending started set of tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests finished.

@Roasbeef
Copy link
Member

Also needs a rebase in light of the recent merges.

@halseth halseth force-pushed the contract-court-without-on-chain branch 6 times, most recently from 716d9f4 to 5f35d47 Compare April 12, 2018 13:28
@halseth
Copy link
Contributor Author

halseth commented Apr 12, 2018

Addressed everything except integration tests and chanarb tests.

// When the state machine reaches that state, we'll log
// out the set of resolutions.
stateCb := func(nextState ArbitratorState) error {
if nextState == StateContractClosed {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you invert this condition, you can save a level of indentation and prevent a lot of line-wrapping.

if nextState != StateContractClosed {
    return nil
}

Admittedly you'd have to have two "return nil"s then, but on net it still seems nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! 👍

@halseth halseth force-pushed the contract-court-without-on-chain branch 2 times, most recently from c87c51f to 93e23f1 Compare April 17, 2018 07:13
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

This PR is getting veeeerrry close...

Moving to start to test this on some lives nodes to weed out any issues introduced and also test out the new RPC modifications.

One lingering thing w.r.t the PR is the incomplete unit tests. We can either block and wait for them to be completed (which will allow be to do some manual testing while they're being finalized), or go ahead with this PR and the follow up with another PR to add the unit tests. The other dependency to this PR (atm at least), is the change to fix the race condition between the brar and the chain watcher. Seems easiest to delete the commit which introduces the partial fix, in favor of going with the ultimate solution developed within the pending PR.

@@ -1406,7 +1406,7 @@ func (c *ChannelArbitrator) channelAttendant(bestHeight int32) {
// We'll now advance our state machine until it reaches
// a terminal state.
_, _, err := c.advanceState(
uint32(bestHeight),
uint32(uniClosure.SpendingHeight),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// trimmed. We'll need to wait for a CSV timeout before we can
// reclaim the funds.
commitRes := contractResolutions.CommitResolution
if commitRes != nil && commitRes.MaturityDelay > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so.

if dbChan.IsBorked {
peerLog.Warnf("ChannelPoint(%v) is borked, won't "+
"start.", chanPoint)
if dbChan.ChanStatus != channeldb.Default {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Not a blocker, but we may want to add a String() method to the ChanStatus variables so this logs a bit nicer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

channeldb/db.go Outdated
if channel.IsPending != pending {
continue
}
if (channel.ChanStatus != Default) != waitingClose {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprised this actually compiles...

I think this would be clearer written as:

switch channel.ChanStatus {
   case Borked && !waitingClose:
      continue
   case CommitmentBroadcasted && !waitingClose:
      continue
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote and added comments to make it more clear what is going on.


// The commitment transaction has not been confirmed, so we expect to
// see a maturity height and blocks til maturity of 0.
assertCommitmentMaturity(t, forceClose, 0, 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check was removed, but never replaced.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the check above for RecoveredBalance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WaitingClose state doesn't have the commitment maturity property, as it is implicit 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for RecoveredBalance

@@ -251,6 +251,30 @@ func (c *chainWatcher) SubscribeChannelEvents(syncDispatch bool) *ChainEventSubs

if syncDispatch {
sub.ProcessACK = make(chan error, 1)

// If this client is syncDispatch, we cannot safely delete it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this commit need to be part of the wider set of commits?

Perhaps we should merge in the fix that you can @cfromknecht are working on before we merge this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept this commit for now, as it is needed to make the integration tests pass reliably. It should be reverted when #1104 is merged.

@halseth halseth force-pushed the contract-court-without-on-chain branch from 93e23f1 to f69bf6f Compare April 19, 2018 13:01
@halseth
Copy link
Contributor Author

halseth commented Apr 19, 2018

Comments addressed and rebased! PTAL 🙏

Copy link
Contributor

@cfromknecht cfromknecht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking really good here! The added unit tests seem comprehensive, and changes to the integration suite make sense. Nothing major from me, just a couple small things/questions :)

@@ -285,6 +285,38 @@ type ChannelCommitment struct {
// * lets just walk through
}

// ChannelStatus is used to indicate whether an OpenChannel is in the default
// usable state, or a state where it shouln't be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouln't -> shouldn't

// Borked indicates that the channel has entered an irreconcilable
// state, triggered by a state desynchronization or channel breach.
// Channels in this state should never be added to the htlc switch.
Borked ChannelStatus = 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we ever decide if we need to distinguish the case between realizing we are borked, or if we detect that they have been borked?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'll be a follow which will lead to a full data-loss protection implementation.

// closure types are possible: cooperative, force, and breach.
// CloseType details exactly _how_ the channel was closed. Four closure
// types are possible: cooperative, local force, remote force, and
// breach.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add FundingCancelled to this list?

Copy link
Contributor Author

@halseth halseth Apr 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we currently have a FundingCancelled state within the daemon? The state could be added to the list when that scenario is something we're aware of. We do, will add.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was it removed in this diff? Didn't pick up...or are we talking about merging some of the other ones?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.


return channel.CloseChannel(summary)
MarkCommitmentBroadcasted: func() error {
return channel.MarkCommitmentBroadcasted()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinkk this closure is unnecessary, since they have the same signature


return NewChannelArbitrator(arbCfg, nil, testLog),
resolvedChan, cleanUp, nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra newline


// As a new block has been connected to the end of the main
// chain, we'll check to see if we need to make any on-chain
// claims on behalf of the channel contract that we're
// arbitrating for.
chainActions := c.checkChainActions(uint32(bestHeight),
chainActions := c.checkChainActions(uint32(triggerHeight),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like triggerHeight is already a uint32, maybe conversion is unnecessary?

Roasbeef
Roasbeef previously approved these changes Apr 21, 2018
Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🔥

@Roasbeef
Copy link
Member

Needs rebase to resolve proto conflict due to recent merge.

halseth added 22 commits April 25, 2018 09:37
…t, advance from height of event not current height
…_watcher

This commit changes the channel arbitrator state machine to only care
about commitment transactions that are being confirmed on-chain
according to the chain_watcher. This is meant to handles the cases where
we would broadcast our commitment, expecting it to get confirmed, but
instead a competing transaction was confirmed.

This commit readies the ChannelArbitrator state machine for the change
that will make the ChainWatcher only notify on confirmed commitments.

The state machine has gotten a new state, StateCommitmentBroadcasted,
which we'll transition to after we have broadcasted our own commitment.
From this state we'll go to the StateContractClosed state regardless of
which commitment the ChainWatcher notifies about, unifying the contract
resolution betweee the local and remote force close.
…enChannel

This commit changes the bool `IsBorked` in OpenChannel to a `ChanStatus`
struct, of type ChannelStatus. This is used to indicated that a channel
that is technically still open, is either borked, or has had a
commitment broadcasted, but is not confirmed on-chain yet.

The ChannelStatus type has the value 1 for the status Borked, meaning it
is backwards compatible with the old database format.
…d when detected

This commit changes the ChainWatcher to only send a chain event in case
the various spends are _confirmed_ on-chain, not only seen on the
network.

A consequence of this is that we now give the ChainWatcher the
responsibility of marking the channel closed when the closing tx is
confirmed, instead of the ChannelArbitrator.
This commit adds MVP unit tests for the following scenarios in the
ChannelArbitrator:
1) A cooperative close is confirmed.
2) A remote force close is confirmed.
3) A local force close is requested and confirmed.
4) A local force close is requested, but a remote force close gets
confirmed.
This commit adds a new method FetchWaitingCloseChannels to the database,
used for fetching OpenChannels that have a ChanStatus != Default. These
are channels that are borked, or have had a commitment broadcasted, and
is now waiting for it to confirm.

The fetchChannels method is rewritten to return channels exclusively
based on wheter they are pending or waitingClose.
This commit changes from using FetchAllChannels to FetchAllOpenChannels,
making the check for whether a channel is pending unnecessary.
 This commit modifies the integration tests to work with the recent
 changes to the ChannelArbitrator, where it will only act on commitments
 that has been confirmed. Main changes involving when to look for
 transactions in the mempool and in blocks, and using the new RPC for
 getting channels in the "waiting close" phase when they are waiting for
 the commitment to confirm.
…cel()

This commit makes clients subscribing to channel events that are marked
"sync dispatch" _not_ being deleted from the list of clients when they
call Cancel(). Instead a go routine will be launched that will send an
error on every read of the ProcessACK channel.

This fixes a race in handing off the breach info while lnd was shutting
down. The breach arbiter could end up being shut down (and calling
Cancel()) before while the ChainWatcher was in the process of
dispatching a breach. Since the breach arbiter no longer was among the
registered clients at this point, the ChainWatcher would assume the
breach was handed off successfully, and mark the channel as pending
closed. When lnd now was restarted, the breach arbiter would not know
about the breach, and the ChainWatcher wouldn't attempt to re-dispatch,
as it was already marked as pending closed.
@halseth halseth force-pushed the contract-court-without-on-chain branch from 9e27499 to 4320421 Compare April 25, 2018 07:37
@halseth
Copy link
Contributor Author

halseth commented Apr 25, 2018

Decided to keep as is. Addressed last comment and rebased 💯

Copy link
Member

@Roasbeef Roasbeef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🌟

@Roasbeef Roasbeef merged commit 86fd9e3 into lightningnetwork:master Apr 26, 2018
@halseth halseth deleted the contract-court-without-on-chain branch July 10, 2018 10:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
commitments Commitment transactions containing the state of the channel notifications
Projects
None yet
5 participants