-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
routing: fix payment state and refactor payment lifecycle tests #5332
routing: fix payment state and refactor payment lifecycle tests #5332
Conversation
846a62c
to
e7ecb4e
Compare
Unfortunately no :( Stuff like https://reviewable.io/ helps do that, but we never managed to make the full switch over. |
e7ecb4e
to
13325b5
Compare
routing/payment_lifecycle.go
Outdated
p.state.Lock() | ||
|
||
// Block until the result is available. | ||
result, err := sh.collectResult(attempt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@halseth Suppose the payment state is,
terminate: false
remainingAmt: 0
numInflightShards: 2
There are two steps here,
- We fail the attempt inside the
collectResult
, this has the effect of marking one of thepayment.HTLCs
as failed, the payment state then becomes,terminate: false remainingAmt: 5000 numInflightShards: 1
- Then calls
handleSendError
at line 116, which will fail the payment, and the state becomes,terminate: true remainingAmt: 5000 numInflightShards: 1
Between step 1 and 2, other goroutines may act and change the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation! I'm still trying to understand why changing the state between step 1 and 2 would be a problem though. Payment attempt outcome and overall payment outcome should ideally be decoupled enough for that not to be a problem.
What could happen between 1 and 2 that would lead to unexpected behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for reading this!
Tbh I am too trying to understand its implications. The above case could lead to the situation where we would fire one more shard in the next round of the lifecycle
loop. Suppose this new one is settled, but it failed to update the payment in db, and the previous step 2 now finished, then we have a payment that's both settled and failed? (Don't know if it's possible tho, still needs to read more code on MMP part).
This is a case when I was refactoring the tests, data race came up because we were updating the payment in one goroutine, while reading it from another. Previously, whenever we construct the payment state, by calling payment.SentAmt()
, payment.TerminalInfo()
, etc, there is another goroutine modifying them.
I've also updated this PR, realized that there's a mutex used in control tower when we update the payment. That mutex can be used too when we read the payment state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the payment has both failed and succeed (which can happen if both happens simultaneously), the success takes precedence:
Lines 349 to 367 in 9163d1a
// Use the DB state to determine the status of the payment. | |
var paymentStatus PaymentStatus | |
switch { | |
// If any of the the HTLCs did succeed and there are no HTLCs in | |
// flight, the payment succeeded. | |
case !inflight && settled: | |
paymentStatus = StatusSucceeded | |
// If we have no in-flight HTLCs, and the payment failure is set, the | |
// payment is considered failed. | |
case !inflight && failureReason != nil: | |
paymentStatus = StatusFailed | |
// Otherwise it is still in flight. | |
default: | |
paymentStatus = StatusInFlight | |
} |
I see what you mean that firing a new shard is unnecessary after the attempt is failed, but before we have marked the payment failed, but it should be fine.
Maybe the data race you saw is a missing mutex within the mock used during the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Further looked into it and yeah, it's a data race within the test. But this line got my attention tho, should there be a lock during the following read?
Lines 188 to 192 in 9163d1a
func (p *controlTower) FetchPayment(paymentHash lntypes.Hash) ( | |
*channeldb.MPPayment, error) { | |
return p.db.FetchPayment(paymentHash) | |
} |
I've updated the PR again. If the inconsistency is not an issue, then yeah there's no need to acquire a lock. The extra shard will then resolve itself out in the next round of lifecycle
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the mutex is not needed here since it is only used when sending a notification.
2e2b65d
to
831602b
Compare
831602b
to
253a8bd
Compare
Needs a rebase! |
Will be looking to merge this as a whole instead of #5218 then this since this builds upon the base PR cleanly. |
The simulated error returned was rejected due to signature failure, and didn't simulate correctly the insufficient fees error as intended. Fix error by including correct signature.
This commit refactors some of the tests in router_test.go to use the require package.
This commit moves the handleSendError method from ChannelRouter to shardHandler. In doing so, shardHandler can now apply updates to the in-memory paymentSession if they are found in the error message.
This commit adds the method UpdateAdditionalEdge in PaymentSession, which allows the addtional channel edge policy to be updated from a ChannelUpdate message. Another method, GetAdditionalEdgePolicy is added to allow querying additional edge policies.
This commit adds payment session to shardHandler to enable private edge policies being updated in shardHandler. The relevant interface and mock are updated. From now on, upon seeing a ChannelUpdate message, shardHandler will first try to find the target policy in additionalEdges and update it. If nothing found, it will then check the database for edge policy to update.
e6a430a
to
33caf07
Compare
This commit renames the mock structs by appending Old in their names. In doing so the old tests stay unchanged and new mock structs can be added in the following commit.
This commit uses the package mock to create new mock structs, replacing the old ones for better control when writing tests.
This commit refactors the resumePayment to extract some logics back to paymentState so that the code is more testable. It also adds unit tests for paymentState, and breaks the original MPPayment tests into independent tests so that it's easier to maintain and debug. All the new tests are built using mock so that the control flow is eaiser to setup and change.
33caf07
to
cd35981
Compare
Done! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solid work! We definitely have a bit more work ahead of us here to make this section of the codebase more scrutable, but this is certainly a step in the right direction. I like the new style you've introduced here in the newly added tests where we work up our series of assertions, then asynchronously kick off the function under test and assert that things lined up with our expectations.
One lingering question re an additional case that was added in the paymentState
struct, and if that may address some recently reported issues that have popped up.
LGTM 🍙
@@ -529,3 +530,182 @@ func (m *mockControlTowerOld) SubscribePayment(paymentHash lntypes.Hash) ( | |||
|
|||
return nil, errors.New("not implemented") | |||
} | |||
|
|||
type mockPaymentAttemptDispatcher struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice move re renaming the older versions to keep things in place and create a new base test abstraction for any future changes!
// If we have in flight shards and the payment is in final stage, we | ||
// need to wait for the outcomes from the shards. Or if we have no more | ||
// money to be sent, we need to wait for the already launched shards. | ||
if ps.numShardsInFlight == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this represent a bug fix as the existing switch statement didn't exactly account for this?
(still getting through the diff)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would say so. When there are no shards in flight, there's no need to wait for outcomes from shards. When a new payment is created, we have zero shards thus there are no outcomes to wait for. When we are in the middle of a MPP, zero numShardsInFlight
could mean three cases,
- all the HTLC attempts failed (
h.Failure != nil
). In this case,remainingAmt
would be the full payment amount,terminate
is false unless we have a "payment level" failure. - all the HTLC attempts settled (
h.Settle !=nil
). In this case,remainingAmt
would be 0, andterminate
is true. - a mix of settled and failed HTLC attempts coexist (bad).
remaingAmt
is not 0, butterminate
is true.
And we need to check this logic together with terminated()
, as we have the cases,
case currentState.terminated():
...
case currentState.needWaitForShards():
...
Now for case 2 and 3, because ps.terminate
is true and ps.numShardsInFlight == 0
, this means we would go to case currentState.terminated():
.
For case 1, if the payment failed, it would also go to case currentState.terminated():
, otherwise we would launch new shards.
That being said, the complexity shown here probably means we have some improvement to achieve. There are three states, ps.terminate == true
, ps.remainingAmt == 0
and ps.numShardsInFlight == 0
to control the workflow. In theory, we should have 8 conditions to be checked. That's a bit much. Underneath the states, we use payment and HTLC attempts to control them. This is the part that brings complexity too. We have a payment with multiple HTLC attempts (one-to-many). Each of the HTLC has states, settled, failed, or in-flight. A settled HTLC means the payment has state settled, a failed HTLC does not. A payment has its own failed state determined by a different set of logic. I think this could be the root cause of all the complexity and unclear abstraction found in the routing
package.
// updatePaymentState will fetch db for the payment to find the latest | ||
// information we need to act on every iteration of the payment loop and update | ||
// the paymentState. | ||
func (p *paymentLifecycle) updatePaymentState() (*channeldb.MPPayment, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updatePaymentState
makes it semes like we're mutating the state on disk. How about fetchPaymentState
instead?
Also some small grammatical changes to the comment:
updatePaymentState will query the db for the latest payment state information we need to act on every iteration of the payment loop and update the paymentState.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, now fixed!
// a terminal state (settled or permanent failure), while we | ||
// were pathfinding. We know we're in a terminal state here, | ||
// so we can continue and wait for our last shards to return. | ||
case err == channeldb.ErrPaymentTerminal: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we need to handle this terminal payment state any longer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually we do need. This was deleted when the previous change locked the payment state so we wouldn't hit this error. This is now put back.
case <-p.router.quit: | ||
case <-p.quit: | ||
} | ||
// Overwrite errToSend and return. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth mentioning here (to make the control flow clearer) that we're just setting the "return param" for the defer
above (handleResultErr
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated!
@@ -456,32 +493,18 @@ func (p *shardHandler) collectResultAsync(attempt *channeldb.HTLCAttemptInfo) { | |||
attempt.AttemptID, p.identifier, err) | |||
} | |||
|
|||
select { | |||
case p.shardErrors <- err: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice usage of the defer here to get rid of that duplication!
testCases := []struct { | ||
name string | ||
|
||
// Use the following three params, each is equivalent to a bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
var failedAttempt channeldb.HTLCAttempt | ||
controlTower.On("FailAttempt", | ||
identifier, mock.Anything, mock.Anything, | ||
).Return(&failedAttempt, nil).Run(func(args mock.Arguments) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting alternative to the way we typically do things (sort of create the mocks as normal struct and "re-implement" any logic not able to be mocked out cleanly). This route sort of clutters the call site in the tests (imo), but as you mentioned earlier lets us more directly assert the control flow based on our "ideal mental model".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This route sort of clutters the call site in the tests (imo), but as you mentioned earlier lets us more directly assert the control flow based on our "ideal mental model".
Agreed that there are unnessary (imo) side effects used in these mock functions. Ideally it should just be mock.On("method").Return(something)
. A lot of the extra lines are used to contruct a wanted payment state. If we could somehow mock currentState.terminated()
and currentState.needWaitForShards()
then the test would be much easier. I guess this will be a future improvement.
Travis runs look good other than the shutdown issue, and we're running into that reported timeout in one of the switch tests (unrelated to these changes), will land as is. |
Depends on #5218, only the
last 7 commitslast 4 commits are relevant.This PR fixes a potential issue that causes inconsistency in the
paymentState
. In particular, thepaymentState.terminate
might become inconsistent with the rest. A new mock library is added to assist the unit testing and be further applied in refactoring the old MPP-related tests.PS, is there a way to view the diff of a PR based on another PR on github?