-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fine grained error handling #950
Conversation
Great start! I forgot to mention this during our last chat: can you please have a look over issue #68, especially the comments. We also had some initial effort into fixing the dynamic errors problem in PR #493, which we abandoned because it was not a priority and was too disruptive. I think you captured most of the considerations from that issue & the abandoned PR, but looking through them might still help provide a bit more context on this topic. |
Note that, as part of the effort towards |
Thanks for the pointer! Yes I agree with some of the arguments made in #68, that we should prefer static errors than dynamic errors wrapped in
Good to know that. I think it's fine to use Also, moving entirely away from anomaly is probably going to take a while. Do you think it is better to better classify the errors first in multiple smaller PRs, or do we refactor everything in one go? |
It's not yet clear to me how to recover backtraces in the As for the What do you think? |
By this do you mean that every function in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @soareschen! Overall I really like these changes, especially the RetryableError
trait. I left a suggestions for dropping a few suffixes and move some methods in the Kind
impl.
Now that I think of it, perhaps we could move them to a impl Error {}
block, where Error
is type Error = anomaly::Error<Kind>
, instead of within Kind
itself?
relayer/src/error.rs
Outdated
@@ -30,6 +37,9 @@ pub enum Kind { | |||
#[error("RPC error to endpoint {0}")] | |||
Rpc(tendermint_rpc::Url), | |||
|
|||
#[error("call to Tendermint RPC returned error: {0}")] | |||
TendermintRpcError(TendermintRpcError), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we leave out the Error
suffix in the variant names, here and elsewhere?
TendermintRpcError(TendermintRpcError), | |
TendermintRpc(TendermintRpcError), |
Context::new(err, None).into() | ||
} | ||
|
||
pub fn send_error<T>(err: SendError<T>) -> Error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the _error
suffix from these methods and move them into the Kind
impl? This way the suffix is redundant and we can import them qualified by the Kind
type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I want to avoid that, because moving them into impl
block makes all of them implicitly imported. With explicit import, we can just look at the import list and know what errors can be raised in a particular function.
Ideally the error domain of a function should exactly match the enumerations in the error type, however that is not the case right now in our code base. For example, the function in ChainRuntime
only raise a SendError
, but the type signature indicates that it can raise arbitrary errors. We should probably refactor that in the future, but for now the explicit import list serves as a good way for static inspection.
use std::time::Duration; | ||
|
||
pub use retry::{ | ||
delay::{Fibonacci, Fixed}, | ||
retry_with_index, OperationResult as RetryResult, | ||
}; | ||
|
||
/// When encountering an error, indicates whether the relayer should | ||
/// perform retry on the same operation. | ||
pub trait RetryableError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
relayer/src/chain/cosmos.rs
Outdated
.into_inner() | ||
.account | ||
.unwrap() | ||
.value | ||
.as_slice(), | ||
) | ||
.map_err(|e| Kind::Grpc.context(e))?; | ||
.map_err(|e| protobuf_decode_error::<BaseAccount>(e))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a clippy warning to fix here.
Just my two cents - I love the Quoting the post, the idea is to avoid a single (overloaded) Error enum and make 'unhandled errors unrepresentable' ->
In essence, this would mean that we return nested fn compare_and_swap(
&mut self,
key: Key,
old_value: Value,
new_value: Value
) -> Result<Result<(), CompareAndSwapError>, sled::Error> {
// ...
} Usage -> // we can actually use try `?` now
let cas_result = sled.compare_and_swap(
"dogs",
"pickles",
"catfood"
)?;
if let Err(cas_error) = cas_result {
// handle expected issue
} I would also caution against using deeply nested error enums because destructuring them can be painful. |
I think what the article described can be implemented as nested error enum like: enum RecoverableError<E1, E2> {
Recoverable<E1>,
Unrecoverable<E2>,
} We can then just have However, I think the issue we are having right now is that we have not yet properly classified which errors are recoverable and which are not, and there are too many of them. If we have done that, it could just be a matter of choosing the right design pattern to reduce boilerplate and encourage good practices. There certainly could have been better ways to handle errors compared to what we are doing right now in the code base. We just need to find ways to do incremental improvement on it. That being said, IMO we are on the right track of handling errors in Rust using explicit |
@hu55a1n1 That's a very good point! I had somehow forgotten about that post, thanks for bringing it up :) @soareschen Agreed, I think we may first want to classify the errors using the |
I'm trying out another new pattern that we could use, which is to nest on #[error("Key ring error: {0}")]
KeyRing(crate::keyring::errors::Kind),
#[error("ICS 018 error: {0}")]
Ics18(ibc::ics18_relayer::error::Kind), I was originally forced to do this because for some reason If you look at the diff, you can also see that the static type has helped me catch a number of old type-erasing errors that caused misclassification of errors. We sometimes even blindly cast the same |
Love the conversation here, thanks @soareschen for leading the charge and the fantastic insight from @romac and @hu55a1n1. Given how IBC liveness depends on correct error handling I'm glad to see we are making progress towards the right abstractions. Just from a scoping perspective, as I assume much of this span multiple PR. Can we clarify the acceptance criteria for this PR by enumerating the specific IO operations we want to retry with the new static error types we've developed above? |
I think the goal of this PR is to gather consensus first on how we want to approach solving the issue. Once we agree on the approach, we can merge this PR and create subsequent PRs that refactor the code following the same approach.
Yes, I think the scoping is not clear enough on what needs to be done. We only know that we want to retry based on what kind of errors returned, but we have not yet decided on which errors should be retried and which are not. Part of the effort is to determine what errors are being raised in the relayer code and what each error really means, before we can decide on the retry logic. Once we have decided on that, it should be a matter of updating the implementations of |
@ancazamfir based on yesterday discussion, I have added an I have also modified |
Hi @soareschen I will take a look today, sorry for the delay. Will test to see if anything breaks and also will try to integrate with your changes in a couple of places. |
843d02f
to
c22c82e
Compare
5f69e17
to
63b815c
Compare
Add RetryableError trait Address review feedback More error refinement Add tx_commit_error and abci_query_error Implement initial mapping of error code from ibc-go into concrete errors Fix formatting Fix rebase
5db0f98
to
44f3aa3
Compare
Closing this PR with follow-up:
|
Partially Fixes: #712
Description
This is an initial refactoring that I have done to try and tame the error handling logic in the relayer. There are some design decisions which I'll try to explain here. We could also create an ADR to describe the design.
Error Constructors
Currently we are using
anomaly
to manage nested errors and backtraces. The abstraction provided byanomaly
allows us to easily capture backtraces, with an error context containing previous dynamic errors that cause an error of interest. While the design ofanomaly
is good enough for its intended purpose, there are some issues arise from its use.In particular, the dynamic nature of
anomaly
makes it very easy to raise errors from any source error type. This makes it difficult for us to statically inspect what kind of underlying errors can cause a particular error to be raised. As an example, theGrpc
error is raised when any error occurs when making a GRPC call, whether it is caused by connection issues, decoding errors, or application errors.To fix this issue, we want to use static types to help enforce what kind of underlying errors can cause the current error to be raised. As an example, we may want to split the
Grpc
error into sub-errors such asGrpcStatusError
when invalid response status is returned, andGrpcTransportError
when there is an error in the underlying transport. Furthermore, we want to use the type system to enforce such that the error cause cannot be mixed. i.e. we want to prevent the code from accidentally raising aGrpcStatusError
when the underlying cause is actually a transport error.To enforce this, we could try to modify the constructor for each sub-error to have the underlying cause as an argument. However doing so would break the abstraction that
anomaly
has given to us, since we would then have to manually expose the error context and backtraces. We could alternatively consider other error handling libraries, or come out with our own error handling abstraction. However doing so would require significant effort to refactor the entire code base to move away fromanomaly
.Instead, a more pragmatic approach is to try and retrofit our requirement on top of
anomaly
, so that we can slowly migrate to the new error handling approach. In our proposed solution, we continue to use theKind
pattern to classify the errors, and useanomaly
to propagate the underlying errors. However we want to refactor the code such that we can keep all constructors to theKind
error private, and expose error constructor functions to construct errors.The main idea is that with a constructor function, the concrete type of the underlying error must also be provided alongside. Inside the constructor function, we proceed as usual and type-erase the underlying error using
anomaly::Context
. But from the outside, the only way to construct a specific error is by giving the concrete error value.As an example, in this PR we have the error constructor function
fn grpc_transport_error(err: tonic::transport::Error) -> Error
, which allows the application to raise aGrpcTransportError
only when atonic::transport::Error
happens.RetryableError
Another abstraction I try to introduce is the
RetryableError
trait. This trait abstracts away the error details, and tell us whether an operation should be retried when a given error happens. By implementing this trait on the error types of interest, we can decouple the retry operations from the logic of determining whether an error should be retried.Currently as the possible errors are under-specified, I leave a default implementation that all sub-errors in an error type being retryable. We can slowly refine the implementation as we understand better on each specific errors.
Case Splitting Underlying Errors
With our proposed approach, all underlying errors are still type-erased by
anomaly
intoBoxError
. However when implementingRetryableError
, we may encounter cases where we will need to peek into an underlying sub-error to determine whether the error is retryable. In such case, there is a simpler solution, which is to inspect the underlying error during construction, and classify them as different sub-errors.As an example, consider the error constructor
fn tendermint_rpc_error(err: tendermint_rpc::Error) -> Error
. We may want to decide whether the outer error is retryable depending on the error code returned by Tendermint RPC. In such case, we would inspect the error code insidetendermint_rpc_error
, and construct either of the sub-errorsRecoverableTendermintError
andUnrecoverableTendermintError
.Different Roles of Error Constructs
In summary, we try to identify different roles of the error constructs, and see how different abstractions are used to handle the roles separately.
Error Context and Backtraces
This is for debugging purpose when an error is raised, and we want to find out how the error is propagated. This is handled well by
anomaly
.Error Construction
We want to construct errors such that a higher-level sub-error is caused by a specific lower-level error type. This allows errors to be properly classified and avoid cases where dynamic errors are misclassified. The application code is only concerned with constructing (throwing) errors and do not care about the internal structure.
The error constructor function allows us to address this issue.
Error Display
When an error is raised, we want to inform the user with as much information available. This is handled by defining separate sub-errors in an error enum and use the
#[error()]
pragma bythiserror
to format the different error messages.Error Recovery
A higher-level application may want to recover from errors raised by a lower-level application by inspecting the internal enum structure. In this case, each sub-error in the enum may carry additional information on whether an error is recoverable.
In practice, we are not doing much on error recovery based on specific sub-errors. As a result, our classification of sub-errors in the error enums are used more for display and information purposes.
Error Retry
A concept related to error recovery is on retrying an operation when errors are raised. This is different from the error recovery mechanism, as here the application is only concerned with whether it should retry the operation. On the contrary, error recovery typically need to understand more details about the internal organization of the lower-level errors. In this PR, we introduce the
RetryableError
so that it encapsulates the information on whether a sub-error is retryable.For contributor use:
docs/
) and code comments.Files changed
in the Github PR explorer.