-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle expired client errors in workers #1664
Conversation
@@ -622,8 +637,16 @@ impl<ChainA: ChainHandle, ChainB: ChainHandle> Connection<ChainA, ChainB> { | |||
|
|||
match self.handshake_step(state) { | |||
Err(e) => { | |||
error!("failed {:?} with error {}", state, e); | |||
RetryResult::Retry(index) | |||
if e.is_expired_or_frozen_error() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We abort the handshake without retrying if the client has expired.
relayer/src/worker/connection.rs
Outdated
retry_with_index(retry_strategy::worker_default_strategy(), |index| { | ||
handshake_connection.step_event(event.clone(), index) | ||
}) | ||
.map_err(|e| TaskError::Fatal(RunError::retry(e)))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By returning TaskError::Fatal
, we abort the worker in case there is any error. This fixes the issue of the worker repeatedly try to do the connection handshake on an expired client.
The same test is now within the channel expiration test
I noticed the E2E with gaia v6 failing so I restarted that. @soareschen is there a particular (i.e., manual) recipe we can use to test the behavior of this work in practice? Or do we rely entirely on integration tests? |
The E2E tests seems to be flaky for a while now. For manual test, the best way is to look at Currently due to the way the event subscription is wired up, it is not possible to test the workers individually without more significant refactoring. Ideally, the test should allow spawning of the connection or channel worker directly, so that it can probe that the worker task terminated as expected. This would require more flexible creation of the event subscription without having to go through everything through the supervisor. |
// FIXME: This returns error if the update event contains expired client state. | ||
// This can happen even if the latest client state is unexpired, but the | ||
// update event is from earlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should we do about this? Leave for future PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tried to reproduce the error with a simpler test and added it here as a manual test. I also made the misbehavior task always continue after initial check, so that it can still act on new client update events.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I guess we can now remove the FIXME
.query_balance(&chains.node_a.wallets().user1().address(), &denom_a)?; | ||
|
||
assert_eq( | ||
"balance on wallet A should decrease", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really the behavior we expect? Should we not check whether or not the client is expired before we proceed to the transfer? What happens to the funds withdrawn from wallet A otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test passes, so that is the current behavior of ibc-go. I think we could file an issue in ibc-go to fail an ibc transfer if the local client is expired or frozen.
On the other hand, a chain probably can't check or guarantee that a counter party client is unfrozen, since everything is asynchronous. So the check would still only handle the simple case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see! Yeah I think it might be worth filing an issue to discuss this further :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wait actually it might be that big of a deal, since the IBC packets have timeout. So any such failed token transfer will eventually be refunded. We should also write integration tests for such cases in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test passes, so that is the current behavior of ibc-go. I think we could file an issue in ibc-go to fail an ibc transfer if the local client is expired or frozen.
I think ibc-go v1.0.0 has protections against this. What chain binary are we using, does it have ibc-go v1.0?
/* | ||
This test reproduce the error log when a misbehavior task is | ||
first started. The error arise when `detect_misbehaviour_and_submit_evidence` | ||
is called with `None`, and the initial headers are already expired. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adizere @romac I added this manual test to reproduce the error log from the misbehavior detection. The key is that when detect_misbehaviour_and_submit_evidence
is called with None
, and some update client events contain expired headers, then the error log is shown. Note however the returned result is still ValidClient
, which is weird.
|
||
// If the counterparty state is already Open but current state is TryOpen, | ||
// return anyway as the final step is to be done by the counterparty worker. | ||
(State::TryOpen, State::Open) => return Ok((None, Next::Abort)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new exit condition for the connection/channel worker. This is required because if there is a race in the handshake step with an external party, such as bootstrap_connection
, the worker may otherwise get stuck in a loop and keep processing new block events even when the handshake has completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
* Improve spawning of supervisor worker tasks * Add test to reproduce client expiry error * Trying to make connection worker abort on error * Properly terminate connection worker when client is expired * Abort channel worker on client expired error * New issue found in handling client expiration error in workers * Fix mock test failure * Terminate packet worker when client is expired * Do not retry channel creation if client is expired * Abort connection and channel worker when handshake is completed * Use better names for worker tasks * Improve connection expiration test * Use better names for worker tasks * Add integration tests for connection and channel workers * Fix connection and channel workers * Fix typo * Reorder arguments in assert_eventually_succeed * Make task step runner return Next::Continue/Abort * Make init_connection/channel return initialized Connection/Channel * Refactor connection/channel established as assertions * Found a bug in connection handshake code * Fix incorrect ordering in restore_from_event * Automate packet worker * Log handshake step result as info * Remove connection expiration test The same test is now within the channel expiration test * Move client_expiration tests to non-manual * Try to tame misbehavior task error on expiration * Make handshake_step return task::Next instead of bool * Update comment instruction for running expiration tests * Slightly improve misbehavior task and add failure test * Slightly simplify misbehavior expiration test * Add changelog * Abort connection/channel worker if counterparty state is already Open
Closes: #1543
Description
Handle errors arise from expired client in connection workers.
I have written some manual integration tests to identify the issues and verify that the solution is working.
This PR is based on the work of #1656 to simplify the way tasks can abort gracefully.
Unfortunately there are still outstanding issues to be addressed, and the refactoring of #1656 does not seem to be sufficient to handle all case of errors.(This is now resolved)PR author checklist:
unclog
.docs/
).Reviewer checklist:
Files changed
in the GitHub PR explorer.