Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

object_store: retry on response decoding errors #6519

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

erratic-pattern
Copy link
Contributor

Closes #6287

This PR includes reqwest::Error::Decode as an error case to retry on, which can occur when a server drops a connection in the middle of sending the response body.

Closes apache#6287

This PR includes `reqwest::Error::Decode` as an error case to retry on,
which can occur when a server drops a connection in the middle of
sending the response body.
@tustvold
Copy link
Contributor

tustvold commented Oct 6, 2024

Have you tested this, I ask as the retry logic occurs before the response body processing

@erratic-pattern
Copy link
Contributor Author

Have you tested this, I ask as the retry logic occurs before the response body processing

I haven't. My assumption was that this error originates from within the reqwest client, so we should be able to catch it here. Even in the streaming case I thought we should still be polling data from the RetryClient, but I will look more closely at the dataflow here to see what I've missed.

Testing this is a bit annoying without mocking the inner client, which isn't exactly a real world scenario. I am not very familiar with the existing testing harness so if you have any recommendations on where to start I would appreciate it.

@tustvold
Copy link
Contributor

tustvold commented Oct 7, 2024

I thought we should still be polling data from the RetryClient

Unfortunately, more broadly speaking this is not generally possible, as discussed on the ticket. Once response streaming has started, a retry would need to somehow resume from where it left off, the semantics of which will depend on the method in question. I do not know of a good way to handle this.

Testing this is a bit annoying without mocking the inner client, which isn't exactly a real world scenario. I am not very familiar with the existing testing harness so if you have any recommendations on where to start I would appreciate it.

We already have a mock HTTP server harness for running these sorts of tests

@erratic-pattern erratic-pattern marked this pull request as draft October 7, 2024 12:43
@erratic-pattern
Copy link
Contributor Author

Unfortunately, more broadly speaking this is not generally possible, as discussed on the ticket.

Could you link me to where this is discussed? I'm afraid there's been a lot of comments around this spread across various issues, so it is hard to find specific discussions.

Once response streaming has started, a retry would need to somehow resume from where it left off, the semantics of which will depend on the method in question. I do not know of a good way to handle this.

Is "method" in this context referring to the HTTP method?

Perhaps we need a RetryStream to wrap the response stream in? I am not sure how that would hook up to the existing trait methods exactly, but it seems necessary if we want to transparently re-initiate response streaming.

#6287 suggests having a manual way to re-initiate the request, but I'm not sure what that would look like either.

@tustvold
Copy link
Contributor

tustvold commented Oct 7, 2024

Could you link me to where this is discussed?

#6287 (comment)

Perhaps we need a RetryStream to wrap the response stream in? I am not sure how that would hook up to the existing trait methods exactly, but it seems necessary if we want to transparently re-initiate response streaming.

I think we could add a method to RetryableRequest to return a Result<Bytes> that can be used by non-idempotent, non-streaming requests, such as ObjectStore::list, and which will retry errors during response streaming by retrying the entire request.

However, ObjectStore::get will require retrying at a higher level, as not only will it need to keep track of the current offset, but compute a new range for the retry, and re-sign the resulting request. Perhaps something in GetClientExt might work 🤔

@alamb
Copy link
Contributor

alamb commented Oct 8, 2024

In my opinion, to move forward we really need an example/test showing the problem so we can evaluate how the proposed solution fixes it. More discussion here: #6287 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

object_store: Retry on connection duration timeouts?
3 participants