Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle expired client errors in workers #1664

Merged
merged 39 commits into from
Dec 23, 2021
Merged

Handle expired client errors in workers #1664

merged 39 commits into from
Dec 23, 2021

Conversation

soareschen
Copy link
Contributor

@soareschen soareschen commented Dec 9, 2021

Closes: #1543

Description

Handle errors arise from expired client in connection workers.

I have written some manual integration tests to identify the issues and verify that the solution is working.

This PR is based on the work of #1656 to simplify the way tasks can abort gracefully. Unfortunately there are still outstanding issues to be addressed, and the refactoring of #1656 does not seem to be sufficient to handle all case of errors. (This is now resolved)


PR author checklist:

  • Added changelog entry, using unclog.
  • Added tests: integration (for Hermes) or unit/mock tests (for modules).
  • Linked to GitHub issue.
  • Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

  • Reviewed Files changed in the GitHub PR explorer.
  • Manually tested (in case integration/unit/mock tests are absent).

@@ -622,8 +637,16 @@ impl<ChainA: ChainHandle, ChainB: ChainHandle> Connection<ChainA, ChainB> {

match self.handshake_step(state) {
Err(e) => {
error!("failed {:?} with error {}", state, e);
RetryResult::Retry(index)
if e.is_expired_or_frozen_error() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We abort the handshake without retrying if the client has expired.

retry_with_index(retry_strategy::worker_default_strategy(), |index| {
handshake_connection.step_event(event.clone(), index)
})
.map_err(|e| TaskError::Fatal(RunError::retry(e)))?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By returning TaskError::Fatal, we abort the worker in case there is any error. This fixes the issue of the worker repeatedly try to do the connection handshake on an expired client.

@adizere adizere self-requested a review December 14, 2021 14:18
Base automatically changed from soares/improve-spawn-supervisor-worker-tasks to master December 16, 2021 20:11
@soareschen soareschen marked this pull request as ready for review December 20, 2021 11:18
@adizere
Copy link
Member

adizere commented Dec 21, 2021

I noticed the E2E with gaia v6 failing so I restarted that.

@soareschen is there a particular (i.e., manual) recipe we can use to test the behavior of this work in practice? Or do we rely entirely on integration tests?

@soareschen
Copy link
Contributor Author

The E2E tests seems to be flaky for a while now.

For manual test, the best way is to look at tools/integration-test/src/tests/client_expiration.rs and insert suspend() at suitable places to observe the behavior at different steps. The key for this changes is to not see repeating error logs, i.e. the errors should only be displayed once and then the worker should abort.

Currently due to the way the event subscription is wired up, it is not possible to test the workers individually without more significant refactoring. Ideally, the test should allow spawning of the connection or channel worker directly, so that it can probe that the worker task terminated as expected. This would require more flexible creation of the event subscription without having to go through everything through the supervisor.

relayer/src/channel.rs Outdated Show resolved Hide resolved
relayer/src/channel.rs Outdated Show resolved Hide resolved
relayer/src/channel.rs Outdated Show resolved Hide resolved
relayer/src/connection.rs Outdated Show resolved Hide resolved
relayer/src/connection.rs Outdated Show resolved Hide resolved
relayer/src/connection.rs Outdated Show resolved Hide resolved
Comment on lines +1171 to +1173
// FIXME: This returns error if the update event contains expired client state.
// This can happen even if the latest client state is unexpired, but the
// update event is from earlier.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do about this? Leave for future PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to reproduce the error with a simpler test and added it here as a manual test. I also made the misbehavior task always continue after initial check, so that it can still act on new client update events.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! I guess we can now remove the FIXME

relayer/src/worker/channel.rs Show resolved Hide resolved
relayer/src/worker/packet.rs Show resolved Hide resolved
.query_balance(&chains.node_a.wallets().user1().address(), &denom_a)?;

assert_eq(
"balance on wallet A should decrease",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really the behavior we expect? Should we not check whether or not the client is expired before we proceed to the transfer? What happens to the funds withdrawn from wallet A otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test passes, so that is the current behavior of ibc-go. I think we could file an issue in ibc-go to fail an ibc transfer if the local client is expired or frozen.

On the other hand, a chain probably can't check or guarantee that a counter party client is unfrozen, since everything is asynchronous. So the check would still only handle the simple case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! Yeah I think it might be worth filing an issue to discuss this further :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wait actually it might be that big of a deal, since the IBC packets have timeout. So any such failed token transfer will eventually be refunded. We should also write integration tests for such cases in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test passes, so that is the current behavior of ibc-go. I think we could file an issue in ibc-go to fail an ibc transfer if the local client is expired or frozen.

I think ibc-go v1.0.0 has protections against this. What chain binary are we using, does it have ibc-go v1.0?

/*
This test reproduce the error log when a misbehavior task is
first started. The error arise when `detect_misbehaviour_and_submit_evidence`
is called with `None`, and the initial headers are already expired.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adizere @romac I added this manual test to reproduce the error log from the misbehavior detection. The key is that when detect_misbehaviour_and_submit_evidence is called with None, and some update client events contain expired headers, then the error log is shown. Note however the returned result is still ValidClient, which is weird.

@adizere adizere mentioned this pull request Dec 23, 2021
10 tasks

// If the counterparty state is already Open but current state is TryOpen,
// return anyway as the final step is to be done by the counterparty worker.
(State::TryOpen, State::Open) => return Ok((None, Next::Abort)),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new exit condition for the connection/channel worker. This is required because if there is a race in the handshake step with an external party, such as bootstrap_connection, the worker may otherwise get stuck in a loop and keep processing new block events even when the handshake has completed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Member

@romac romac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@romac romac merged commit 03d4716 into master Dec 23, 2021
@romac romac deleted the soares/client-expiry branch December 23, 2021 14:55
hu55a1n1 pushed a commit to hu55a1n1/hermes that referenced this pull request Sep 13, 2022
* Improve spawning of supervisor worker tasks

* Add test to reproduce client expiry error

* Trying to make connection worker abort on error

* Properly terminate connection worker when client is expired

* Abort channel worker on client expired error

* New issue found in handling client expiration error in workers

* Fix mock test failure

* Terminate packet worker when client is expired

* Do not retry channel creation if client is expired

* Abort connection and channel worker when handshake is completed

* Use better names for worker tasks

* Improve connection expiration test

* Use better names for worker tasks

* Add integration tests for connection and channel workers

* Fix connection and channel workers

* Fix typo

* Reorder arguments in assert_eventually_succeed

* Make task step runner return Next::Continue/Abort

* Make init_connection/channel return initialized Connection/Channel

* Refactor connection/channel established as assertions

* Found a bug in connection handshake code

* Fix incorrect ordering in restore_from_event

* Automate packet worker

* Log handshake step result as info

* Remove connection expiration test

The same test is now within the channel expiration test

* Move client_expiration tests to non-manual

* Try to tame misbehavior task error on expiration

* Make handshake_step return task::Next instead of bool

* Update comment instruction for running expiration tests

* Slightly improve misbehavior task and add failure test

* Slightly simplify misbehavior expiration test

* Add changelog

* Abort connection/channel worker if counterparty state is already Open
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hermes should check client expiration
3 participants