Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine grained error handling #950

Closed
wants to merge 1 commit into from
Closed

Conversation

soareschen
Copy link
Contributor

Partially Fixes: #712

Description

This is an initial refactoring that I have done to try and tame the error handling logic in the relayer. There are some design decisions which I'll try to explain here. We could also create an ADR to describe the design.

Error Constructors

Currently we are using anomaly to manage nested errors and backtraces. The abstraction provided by anomaly allows us to easily capture backtraces, with an error context containing previous dynamic errors that cause an error of interest. While the design of anomaly is good enough for its intended purpose, there are some issues arise from its use.

In particular, the dynamic nature of anomaly makes it very easy to raise errors from any source error type. This makes it difficult for us to statically inspect what kind of underlying errors can cause a particular error to be raised. As an example, the Grpc error is raised when any error occurs when making a GRPC call, whether it is caused by connection issues, decoding errors, or application errors.

To fix this issue, we want to use static types to help enforce what kind of underlying errors can cause the current error to be raised. As an example, we may want to split the Grpc error into sub-errors such as GrpcStatusError when invalid response status is returned, and GrpcTransportError when there is an error in the underlying transport. Furthermore, we want to use the type system to enforce such that the error cause cannot be mixed. i.e. we want to prevent the code from accidentally raising a GrpcStatusError when the underlying cause is actually a transport error.

To enforce this, we could try to modify the constructor for each sub-error to have the underlying cause as an argument. However doing so would break the abstraction that anomaly has given to us, since we would then have to manually expose the error context and backtraces. We could alternatively consider other error handling libraries, or come out with our own error handling abstraction. However doing so would require significant effort to refactor the entire code base to move away from anomaly.

Instead, a more pragmatic approach is to try and retrofit our requirement on top of anomaly, so that we can slowly migrate to the new error handling approach. In our proposed solution, we continue to use the Kind pattern to classify the errors, and use anomaly to propagate the underlying errors. However we want to refactor the code such that we can keep all constructors to the Kind error private, and expose error constructor functions to construct errors.

The main idea is that with a constructor function, the concrete type of the underlying error must also be provided alongside. Inside the constructor function, we proceed as usual and type-erase the underlying error using anomaly::Context. But from the outside, the only way to construct a specific error is by giving the concrete error value.

As an example, in this PR we have the error constructor function fn grpc_transport_error(err: tonic::transport::Error) -> Error, which allows the application to raise a GrpcTransportError only when a tonic::transport::Error happens.

RetryableError

Another abstraction I try to introduce is the RetryableError trait. This trait abstracts away the error details, and tell us whether an operation should be retried when a given error happens. By implementing this trait on the error types of interest, we can decouple the retry operations from the logic of determining whether an error should be retried.

Currently as the possible errors are under-specified, I leave a default implementation that all sub-errors in an error type being retryable. We can slowly refine the implementation as we understand better on each specific errors.

Case Splitting Underlying Errors

With our proposed approach, all underlying errors are still type-erased by anomaly into BoxError. However when implementing RetryableError, we may encounter cases where we will need to peek into an underlying sub-error to determine whether the error is retryable. In such case, there is a simpler solution, which is to inspect the underlying error during construction, and classify them as different sub-errors.

As an example, consider the error constructor fn tendermint_rpc_error(err: tendermint_rpc::Error) -> Error. We may want to decide whether the outer error is retryable depending on the error code returned by Tendermint RPC. In such case, we would inspect the error code inside tendermint_rpc_error, and construct either of the sub-errors RecoverableTendermintError and UnrecoverableTendermintError.

Different Roles of Error Constructs

In summary, we try to identify different roles of the error constructs, and see how different abstractions are used to handle the roles separately.

Error Context and Backtraces

This is for debugging purpose when an error is raised, and we want to find out how the error is propagated. This is handled well by anomaly.

Error Construction

We want to construct errors such that a higher-level sub-error is caused by a specific lower-level error type. This allows errors to be properly classified and avoid cases where dynamic errors are misclassified. The application code is only concerned with constructing (throwing) errors and do not care about the internal structure.

The error constructor function allows us to address this issue.

Error Display

When an error is raised, we want to inform the user with as much information available. This is handled by defining separate sub-errors in an error enum and use the #[error()] pragma by thiserror to format the different error messages.

Error Recovery

A higher-level application may want to recover from errors raised by a lower-level application by inspecting the internal enum structure. In this case, each sub-error in the enum may carry additional information on whether an error is recoverable.

In practice, we are not doing much on error recovery based on specific sub-errors. As a result, our classification of sub-errors in the error enums are used more for display and information purposes.

Error Retry

A concept related to error recovery is on retrying an operation when errors are raised. This is different from the error recovery mechanism, as here the application is only concerned with whether it should retry the operation. On the contrary, error recovery typically need to understand more details about the internal organization of the lower-level errors. In this PR, we introduce the RetryableError so that it encapsulates the information on whether a sub-error is retryable.


For contributor use:

  • Updated the Unreleased section of CHANGELOG.md with the issue.
  • If applicable: Unit tests written, added test to CI.
  • Linked to Github issue with discussion and accepted design OR link to spec that describes this work.
  • Updated relevant documentation (docs/) and code comments.
  • Re-reviewed Files changed in the Github PR explorer.

@adizere
Copy link
Member

adizere commented May 17, 2021

Great start! I forgot to mention this during our last chat: can you please have a look over issue #68, especially the comments. We also had some initial effort into fixing the dynamic errors problem in PR #493, which we abandoned because it was not a priority and was too disruptive. I think you captured most of the considerations from that issue & the abandoned PR, but looking through them might still help provide a bit more context on this topic.

@romac
Copy link
Member

romac commented May 17, 2021

Note that, as part of the effort towards no_std compatibility for the ibc crate, we are planning to move away from anomaly and thiserror in favor of displaydoc + manual std::error::Error: #951

@soareschen
Copy link
Contributor Author

can you please have a look over issue #68, especially the comments.

Thanks for the pointer! Yes I agree with some of the arguments made in #68, that we should prefer static errors than dynamic errors wrapped in eyre::Report. I feel that Rust's approach to define error types explicitly helps the reader better understand the application behavior. I would advice against dynamic error approach, especially like the stringly-typed errors in Go, based on the arguments I put out earlier.

Note that, as part of the effort towards no_std compatibility for the ibc crate, we are planning to move away from anomaly and thiserror in favor of displaydoc + manual std::error::Error: #951

Good to know that. I think it's fine to use displaydoc instead of thiserror. But I'm curious how do we plan to capture and propagate stack traces with manual implementation of std::error::Error?

Also, moving entirely away from anomaly is probably going to take a while. Do you think it is better to better classify the errors first in multiple smaller PRs, or do we refactor everything in one go?

@romac
Copy link
Member

romac commented May 17, 2021

But I'm curious how do we plan to capture and propagate stack traces with manual implementation of std::error::Error?

It's not yet clear to me how to recover backtraces in the ibc crate, even when the std feature will be enabled.

As for the relayer and relayer-cli crates, we will likely move from anomaly to eyre once abscissa does so, and should just retain backtraces at this level. Perhaps this enough and we won't lose out too much by not having backtraces in ibc.

What do you think?

@soareschen
Copy link
Contributor Author

soareschen commented May 17, 2021

As for the relayer and relayer-cli crates, we will likely move from anomaly to eyre once abscissa does so, and should just retain backtraces at this level.

By this do you mean that every function in relayer and relayer-cli will return Result<T, eyre::Report>? That seems to be the opposite direction of what I try to do in this PR, which is to make the possible types of errors more concrete. It might be fine to do it in relayer-cli, but the logic in relayer is pretty complex, and I'm not sure having a unitype classification of errors there will help.

Copy link
Member

@romac romac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @soareschen! Overall I really like these changes, especially the RetryableError trait. I left a suggestions for dropping a few suffixes and move some methods in the Kind impl.

Now that I think of it, perhaps we could move them to a impl Error {} block, where Error is type Error = anomaly::Error<Kind>, instead of within Kind itself?

@@ -30,6 +37,9 @@ pub enum Kind {
#[error("RPC error to endpoint {0}")]
Rpc(tendermint_rpc::Url),

#[error("call to Tendermint RPC returned error: {0}")]
TendermintRpcError(TendermintRpcError),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave out the Error suffix in the variant names, here and elsewhere?

Suggested change
TendermintRpcError(TendermintRpcError),
TendermintRpc(TendermintRpcError),

Context::new(err, None).into()
}

pub fn send_error<T>(err: SendError<T>) -> Error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the _error suffix from these methods and move them into the Kind impl? This way the suffix is redundant and we can import them qualified by the Kind type.

Copy link
Contributor Author

@soareschen soareschen May 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I want to avoid that, because moving them into impl block makes all of them implicitly imported. With explicit import, we can just look at the import list and know what errors can be raised in a particular function.

Ideally the error domain of a function should exactly match the enumerations in the error type, however that is not the case right now in our code base. For example, the function in ChainRuntime only raise a SendError, but the type signature indicates that it can raise arbitrary errors. We should probably refactor that in the future, but for now the explicit import list serves as a good way for static inspection.

use std::time::Duration;

pub use retry::{
delay::{Fibonacci, Fixed},
retry_with_index, OperationResult as RetryResult,
};

/// When encountering an error, indicates whether the relayer should
/// perform retry on the same operation.
pub trait RetryableError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

.into_inner()
.account
.unwrap()
.value
.as_slice(),
)
.map_err(|e| Kind::Grpc.context(e))?;
.map_err(|e| protobuf_decode_error::<BaseAccount>(e))?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a clippy warning to fix here.

@hu55a1n1
Copy link
Member

Just my two cents - I love the RetryableError trait approach that @soareschen describes here, but there is another pattern that I would like to point to. This pattern is described in detail here -> Error Handling in a Correctness-Critical Rust Project.

Quoting the post, the idea is to avoid a single (overloaded) Error enum and make 'unhandled errors unrepresentable' ->

make the global Error enum specifically only hold errors that should cause the overall system to halt - reserved for situations that require human intervention. Keep errors that relate to separate concerns in totally separate error types. By keeping errors that must be handled separately in their own types, we reduce the chance that the try ? operator will accidentally push a local concern into a caller that can’t deal with it. If you make such errors representable, they will happen.

In essence, this would mean that we return nested Result types where the outer Result represents system-wide errors that can be handled (short-circuit) with a try ? operator while the inner Result represents more specific internal/recoverable errors. So, this is ultimately just another way of modelling recoverable errors. This approach also helps with maintainability as described in detail in the post itself.
eg. copied from the linked post ->

fn compare_and_swap(
  &mut self,
  key: Key,
  old_value: Value,
  new_value: Value
) -> Result<Result<(), CompareAndSwapError>, sled::Error> {
	// ...
}

Usage ->

// we can actually use try `?` now
let cas_result = sled.compare_and_swap(
	"dogs",
	"pickles",
	"catfood"
)?;

if let Err(cas_error) = cas_result {
	  // handle expected issue
}

I would also caution against using deeply nested error enums because destructuring them can be painful.

@soareschen
Copy link
Contributor Author

I think what the article described can be implemented as nested error enum like:

enum RecoverableError<E1, E2> {
   Recoverable<E1>,
   Unrecoverable<E2>,
}

We can then just have RecoverableError implement RetryableError as an intermediate measure, with the goal that eventually all other error types migrate to use that.

However, I think the issue we are having right now is that we have not yet properly classified which errors are recoverable and which are not, and there are too many of them. If we have done that, it could just be a matter of choosing the right design pattern to reduce boilerplate and encourage good practices.

There certainly could have been better ways to handle errors compared to what we are doing right now in the code base. We just need to find ways to do incremental improvement on it. That being said, IMO we are on the right track of handling errors in Rust using explicit Result. Error handling is in fact an incredibly complex and opinionated topic. In my experience, even with advanced abstractions like algebraic effects, smart people keep ended up implementing the wrong error abstractions like the exception effect, which reinvent the pitfalls of uncheck exceptions in Java.

@romac
Copy link
Member

romac commented May 18, 2021

@hu55a1n1 That's a very good point! I had somehow forgotten about that post, thanks for bringing it up :)

@soareschen Agreed, I think we may first want to classify the errors using the RetryableError trait, and only after that move on to either a nested Result or the RecoverableError enum to statically ensure we deal with recoverable errors, but that's something for another PR.

@soareschen
Copy link
Contributor Author

soareschen commented May 18, 2021

I'm trying out another new pattern that we could use, which is to nest on Kind from the same crate, at the same time propagate the Error using anomaly::Context. For example:

    #[error("Key ring error: {0}")]
    KeyRing(crate::keyring::errors::Kind),

    #[error("ICS 018 error: {0}")]
    Ics18(ibc::ics18_relayer::error::Kind),

I was originally forced to do this because for some reason anomaly::Error does not implement Clone and so it cannot be nested. But now that I think of it, by nesting Kind and type-erasing Error, we kind of get the best of both world with very fine grained error information but also detailed stack traces.

If you look at the diff, you can also see that the static type has helped me catch a number of old type-erasing errors that caused misclassification of errors. We sometimes even blindly cast the same relayer::error::Error into the same relayer::error::Error but now with different sub-error!

@brapse
Copy link
Contributor

brapse commented May 19, 2021

Love the conversation here, thanks @soareschen for leading the charge and the fantastic insight from @romac and @hu55a1n1. Given how IBC liveness depends on correct error handling I'm glad to see we are making progress towards the right abstractions.

Just from a scoping perspective, as I assume much of this span multiple PR. Can we clarify the acceptance criteria for this PR by enumerating the specific IO operations we want to retry with the new static error types we've developed above?

@soareschen
Copy link
Contributor Author

Just from a scoping perspective, as I assume much of this span multiple PR.

I think the goal of this PR is to gather consensus first on how we want to approach solving the issue. Once we agree on the approach, we can merge this PR and create subsequent PRs that refactor the code following the same approach.

Can we clarify the acceptance criteria for this PR by enumerating the specific IO operations we want to retry with the new static error types we've developed above?

Yes, I think the scoping is not clear enough on what needs to be done. We only know that we want to retry based on what kind of errors returned, but we have not yet decided on which errors should be retried and which are not. Part of the effort is to determine what errors are being raised in the relayer code and what each error really means, before we can decide on the retry logic. Once we have decided on that, it should be a matter of updating the implementations of RetryableError to reflect which errors we want to retry on.

@soareschen
Copy link
Contributor Author

@ancazamfir based on yesterday discussion, I have added an sdk_error module (exact path yet to be decided) that maps the SDK errors in ibc-go into concrete error types in Rust. Right now I have only mapped the errors defined in ICS02-Client. Let me know if this is the right approach to go before I proceed with the other mappings.

I have also modified tx_result_to_event in cosmos.rs to return the SDK errors, instead of returning Ok on the error responses. I am not sure if this breaks any code, so please double check.

@ancazamfir
Copy link
Collaborator

I have also modified tx_result_to_event in cosmos.rs to return the SDK errors, instead of returning Ok on the error responses. I am not sure if this breaks any code, so please double check.

Hi @soareschen I will take a look today, sorry for the delay. Will test to see if anything breaks and also will try to integrate with your changes in a couple of places.

Add RetryableError trait

Address review feedback

More error refinement

Add tx_commit_error and abci_query_error

Implement initial mapping of error code from ibc-go into concrete errors

Fix formatting

Fix rebase
@adizere
Copy link
Member

adizere commented Jun 9, 2021

Closing this PR with follow-up:

  • open new issue that is specific for a localized error, with details below:
  • attempt to cause an error inside verify
    • the update client should fail, either Io/Rpc (retriable), or client expiration (not retriable)
  • the verify method should distinguish b/t retriable errors (should be retried) and non-retriable (return)
    • propagate the error up to the CLI level, and retry there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Relayer error handling specification
6 participants