Handle expired client errors in workers #1664

soareschen · 2021-12-09T21:33:13Z

Closes: #1543

Description

Handle errors arise from expired client in connection workers.

I have written some manual integration tests to identify the issues and verify that the solution is working.

This PR is based on the work of #1656 to simplify the way tasks can abort gracefully. ~~Unfortunately there are still outstanding issues to be addressed, and the refactoring of #1656 does not seem to be sufficient to handle all case of errors.~~ (This is now resolved)

PR author checklist:

Added changelog entry, using unclog.
Added tests: integration (for Hermes) or unit/mock tests (for modules).
Linked to GitHub issue.
Updated code comments and documentation (e.g., docs/).

Reviewer checklist:

Reviewed Files changed in the GitHub PR explorer.
Manually tested (in case integration/unit/mock tests are absent).

soareschen · 2021-12-09T21:35:08Z

relayer/src/connection.rs

@@ -622,8 +637,16 @@ impl<ChainA: ChainHandle, ChainB: ChainHandle> Connection<ChainA, ChainB> {

        match self.handshake_step(state) {
            Err(e) => {
-                error!("failed {:?} with error {}", state, e);
-                RetryResult::Retry(index)
+                if e.is_expired_or_frozen_error() {


We abort the handshake without retrying if the client has expired.

soareschen · 2021-12-09T21:37:02Z

relayer/src/worker/connection.rs

+                            retry_with_index(retry_strategy::worker_default_strategy(), |index| {
+                                handshake_connection.step_event(event.clone(), index)
+                            })
+                            .map_err(|e| TaskError::Fatal(RunError::retry(e)))?;


By returning TaskError::Fatal, we abort the worker in case there is any error. This fixes the issue of the worker repeatedly try to do the connection handshake on an expired client.

tools/integration-test/src/tests/manual/client_expiration.rs

…es/client-expiry

The same test is now within the channel expiration test

adizere · 2021-12-21T12:49:02Z

I noticed the E2E with gaia v6 failing so I restarted that.

@soareschen is there a particular (i.e., manual) recipe we can use to test the behavior of this work in practice? Or do we rely entirely on integration tests?

soareschen · 2021-12-21T13:58:14Z

The E2E tests seems to be flaky for a while now.

For manual test, the best way is to look at tools/integration-test/src/tests/client_expiration.rs and insert suspend() at suitable places to observe the behavior at different steps. The key for this changes is to not see repeating error logs, i.e. the errors should only be displayed once and then the worker should abort.

Currently due to the way the event subscription is wired up, it is not possible to test the workers individually without more significant refactoring. Ideally, the test should allow spawning of the connection or channel worker directly, so that it can probe that the worker task terminated as expected. This would require more flexible creation of the event subscription without having to go through everything through the supervisor.

relayer/src/channel.rs

relayer/src/connection.rs

romac · 2021-12-21T16:31:54Z

relayer/src/foreign_client.rs

+            // FIXME: This returns error if the update event contains expired client state.
+            // This can happen even if the latest client state is unexpired, but the
+            // update event is from earlier.


What should we do about this? Leave for future PR?

I have tried to reproduce the error with a simpler test and added it here as a manual test. I also made the misbehavior task always continue after initial check, so that it can still act on new client update events.

Looks good to me! I guess we can now remove the FIXME

relayer/src/worker/channel.rs

relayer/src/worker/packet.rs

romac · 2021-12-21T16:45:03Z

tools/integration-test/src/tests/client_expiration.rs

+                .query_balance(&chains.node_a.wallets().user1().address(), &denom_a)?;
+
+            assert_eq(
+                "balance on wallet A should decrease",


Is that really the behavior we expect? Should we not check whether or not the client is expired before we proceed to the transfer? What happens to the funds withdrawn from wallet A otherwise?

The test passes, so that is the current behavior of ibc-go. I think we could file an issue in ibc-go to fail an ibc transfer if the local client is expired or frozen.

On the other hand, a chain probably can't check or guarantee that a counter party client is unfrozen, since everything is asynchronous. So the check would still only handle the simple case.

I see! Yeah I think it might be worth filing an issue to discuss this further :)

Hmm wait actually it might be that big of a deal, since the IBC packets have timeout. So any such failed token transfer will eventually be refunded. We should also write integration tests for such cases in the future.

The test passes, so that is the current behavior of ibc-go. I think we could file an issue in ibc-go to fail an ibc transfer if the local client is expired or frozen.

I think ibc-go v1.0.0 has protections against this. What chain binary are we using, does it have ibc-go v1.0?

tools/integration-test/src/tests/client_expiration.rs

soareschen · 2021-12-21T21:10:25Z

tools/integration-test/src/tests/client_expiration.rs

+        /*
+           This test reproduce the error log when a misbehavior task is
+           first started. The error arise when `detect_misbehaviour_and_submit_evidence`
+           is called with `None`, and the initial headers are already expired.


@adizere @romac I added this manual test to reproduce the error log from the misbehavior detection. The key is that when detect_misbehaviour_and_submit_evidence is called with None, and some update client events contain expired headers, then the error log is shown. Note however the returned result is still ValidClient, which is weird.

soareschen · 2021-12-23T13:54:42Z

relayer/src/channel.rs

+
+            // If the counterparty state is already Open but current state is TryOpen,
+            // return anyway as the final step is to be done by the counterparty worker.
+            (State::TryOpen, State::Open) => return Ok((None, Next::Abort)),


I added a new exit condition for the connection/channel worker. This is required because if there is a race in the handshake step with an external party, such as bootstrap_connection, the worker may otherwise get stuck in a loop and keep processing new block events even when the handshake has completed.

Good catch!

romac

Great work!

* Improve spawning of supervisor worker tasks * Add test to reproduce client expiry error * Trying to make connection worker abort on error * Properly terminate connection worker when client is expired * Abort channel worker on client expired error * New issue found in handling client expiration error in workers * Fix mock test failure * Terminate packet worker when client is expired * Do not retry channel creation if client is expired * Abort connection and channel worker when handshake is completed * Use better names for worker tasks * Improve connection expiration test * Use better names for worker tasks * Add integration tests for connection and channel workers * Fix connection and channel workers * Fix typo * Reorder arguments in assert_eventually_succeed * Make task step runner return Next::Continue/Abort * Make init_connection/channel return initialized Connection/Channel * Refactor connection/channel established as assertions * Found a bug in connection handshake code * Fix incorrect ordering in restore_from_event * Automate packet worker * Log handshake step result as info * Remove connection expiration test The same test is now within the channel expiration test * Move client_expiration tests to non-manual * Try to tame misbehavior task error on expiration * Make handshake_step return task::Next instead of bool * Update comment instruction for running expiration tests * Slightly improve misbehavior task and add failure test * Slightly simplify misbehavior expiration test * Add changelog * Abort connection/channel worker if counterparty state is already Open

soareschen and others added 6 commits December 6, 2021 19:56

Improve spawning of supervisor worker tasks

0ed7526

Add test to reproduce client expiry error

98740a2

Trying to make connection worker abort on error

569061c

Properly terminate connection worker when client is expired

d867fca

Abort channel worker on client expired error

4a3921f

New issue found in handling client expiration error in workers

cb68670

soareschen commented Dec 9, 2021

View reviewed changes

soareschen mentioned this pull request Dec 9, 2021

Improve spawning of supervisor worker tasks #1656

Merged

6 tasks

soareschen and others added 7 commits December 13, 2021 16:08

Merge branch 'master' into soares/client-expiry

8562ffd

Fix mock test failure

43b558c

Terminate packet worker when client is expired

8b69db3

Do not retry channel creation if client is expired

e8a6303

Abort connection and channel worker when handshake is completed

f2b11a4

Use better names for worker tasks

eb5d069

Improve connection expiration test

4716ae1

adizere self-requested a review December 14, 2021 14:18

soareschen and others added 9 commits December 15, 2021 19:38

Merge branch 'master' into soares/improve-spawn-supervisor-worker-tasks

b0ec483

Use better names for worker tasks

490b8df

Add integration tests for connection and channel workers

aad0e00

Merge branch 'soares/improve-spawn-supervisor-worker-tasks' into soar…

aa3a487

…es/client-expiry

Fix connection and channel workers

d9e2bf0

Fix typo

af16f5b

Reorder arguments in assert_eventually_succeed

e060b4d

Merge branch 'soares/improve-spawn-supervisor-worker-tasks' into soar…

03dbe08

…es/client-expiry

Make task step runner return Next::Continue/Abort

54c8588

Base automatically changed from soares/improve-spawn-supervisor-worker-tasks to master December 16, 2021 20:11

soareschen added 4 commits December 16, 2021 21:17

Make init_connection/channel return initialized Connection/Channel

fc4e4da

Merge remote-tracking branch 'origin/master' into soares/client-expiry

12b5513

Refactor connection/channel established as assertions

7996404

Found a bug in connection handshake code

3be2851

soareschen added 6 commits December 17, 2021 12:43

Fix incorrect ordering in restore_from_event

1f2be2b

Automate packet worker

cfbe00c

Log handshake step result as info

4625e54

Remove connection expiration test

4869758

The same test is now within the channel expiration test

Move client_expiration tests to non-manual

9158564

Try to tame misbehavior task error on expiration

cfffb6b

soareschen marked this pull request as ready for review December 20, 2021 11:18

soareschen requested review from ancazamfir and romac as code owners December 20, 2021 11:18

romac reviewed Dec 21, 2021

View reviewed changes

relayer/src/channel.rs Outdated Show resolved Hide resolved

romac reviewed Dec 21, 2021

View reviewed changes

soareschen added 3 commits December 21, 2021 20:58

Make handshake_step return task::Next instead of bool

d724164

Update comment instruction for running expiration tests

f396feb

Slightly improve misbehavior task and add failure test

3b06fe3

soareschen commented Dec 21, 2021

View reviewed changes

soareschen added 3 commits December 22, 2021 11:00

Slightly simplify misbehavior expiration test

fb659cb

Merge remote-tracking branch 'origin/master' into soares/client-expiry

4396868

Add changelog

2b75a1d

adizere mentioned this pull request Dec 23, 2021

Release ibc-rs v0.10.0 #1712

Merged

10 tasks

Abort connection/channel worker if counterparty state is already Open

e997825

soareschen commented Dec 23, 2021

View reviewed changes

romac approved these changes Dec 23, 2021

View reviewed changes

romac merged commit 03d4716 into master Dec 23, 2021

romac deleted the soares/client-expiry branch December 23, 2021 14:55

soareschen mentioned this pull request Jan 7, 2022

Hermes refresh for clients with small trusting period doesn't work #1563

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle expired client errors in workers #1664

Handle expired client errors in workers #1664

soareschen commented Dec 9, 2021 •

edited

Loading

soareschen Dec 9, 2021

soareschen Dec 9, 2021

adizere commented Dec 21, 2021

soareschen commented Dec 21, 2021

romac Dec 21, 2021

soareschen Dec 21, 2021

romac Dec 22, 2021

romac Dec 21, 2021

soareschen Dec 21, 2021

romac Dec 22, 2021

soareschen Dec 22, 2021

adizere Dec 23, 2021

soareschen Dec 21, 2021

soareschen Dec 23, 2021

romac Dec 23, 2021

romac left a comment

Handle expired client errors in workers #1664

Handle expired client errors in workers #1664

Conversation

soareschen commented Dec 9, 2021 • edited Loading

Description

PR author checklist:

Reviewer checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adizere commented Dec 21, 2021

soareschen commented Dec 21, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romac left a comment

Choose a reason for hiding this comment

soareschen commented Dec 9, 2021 •

edited

Loading