object_store: retry on response decoding errors #6519

erratic-pattern · 2024-10-06T20:56:24Z

This PR includes reqwest::Error::Decode as an error case to retry on, which can occur when a server drops a connection in the middle of sending the response body.

Closes apache#6287 This PR includes `reqwest::Error::Decode` as an error case to retry on, which can occur when a server drops a connection in the middle of sending the response body.

tustvold · 2024-10-06T21:34:25Z

Have you tested this, I ask as the retry logic occurs before the response body processing

erratic-pattern · 2024-10-07T02:24:11Z

Have you tested this, I ask as the retry logic occurs before the response body processing

I haven't. My assumption was that this error originates from within the reqwest client, so we should be able to catch it here. Even in the streaming case I thought we should still be polling data from the RetryClient, but I will look more closely at the dataflow here to see what I've missed.

Testing this is a bit annoying without mocking the inner client, which isn't exactly a real world scenario. I am not very familiar with the existing testing harness so if you have any recommendations on where to start I would appreciate it.

tustvold · 2024-10-07T09:30:13Z

I thought we should still be polling data from the RetryClient

Unfortunately, more broadly speaking this is not generally possible, as discussed on the ticket. Once response streaming has started, a retry would need to somehow resume from where it left off, the semantics of which will depend on the method in question. I do not know of a good way to handle this.

Testing this is a bit annoying without mocking the inner client, which isn't exactly a real world scenario. I am not very familiar with the existing testing harness so if you have any recommendations on where to start I would appreciate it.

We already have a mock HTTP server harness for running these sorts of tests

erratic-pattern · 2024-10-07T13:05:13Z

Unfortunately, more broadly speaking this is not generally possible, as discussed on the ticket.

Could you link me to where this is discussed? I'm afraid there's been a lot of comments around this spread across various issues, so it is hard to find specific discussions.

Once response streaming has started, a retry would need to somehow resume from where it left off, the semantics of which will depend on the method in question. I do not know of a good way to handle this.

Is "method" in this context referring to the HTTP method?

Perhaps we need a RetryStream to wrap the response stream in? I am not sure how that would hook up to the existing trait methods exactly, but it seems necessary if we want to transparently re-initiate response streaming.

#6287 suggests having a manual way to re-initiate the request, but I'm not sure what that would look like either.

tustvold · 2024-10-07T13:27:17Z

Could you link me to where this is discussed?

#6287 (comment)

Perhaps we need a RetryStream to wrap the response stream in? I am not sure how that would hook up to the existing trait methods exactly, but it seems necessary if we want to transparently re-initiate response streaming.

I think we could add a method to RetryableRequest to return a Result<Bytes> that can be used by non-idempotent, non-streaming requests, such as ObjectStore::list, and which will retry errors during response streaming by retrying the entire request.

However, ObjectStore::get will require retrying at a higher level, as not only will it need to keep track of the current offset, but compute a new range for the retry, and re-sign the resulting request. Perhaps something in GetClientExt might work 🤔

alamb · 2024-10-08T16:24:15Z

In my opinion, to move forward we really need an example/test showing the problem so we can evaluate how the proposed solution fixes it. More discussion here: #6287 (comment)

object_store: retry on response decoding errors

ffa3a4b

Closes apache#6287 This PR includes `reqwest::Error::Decode` as an error case to retry on, which can occur when a server drops a connection in the middle of sending the response body.

github-actions bot added the object-store Object Store Interface label Oct 6, 2024

This was referenced Oct 6, 2024

object_store: Retry on connection duration timeouts? #6287

Open

error decoding response body after upgrade to object store 0.10 #5882

Open

erratic-pattern marked this pull request as draft October 7, 2024 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

object_store: retry on response decoding errors #6519

object_store: retry on response decoding errors #6519

erratic-pattern commented Oct 6, 2024

tustvold commented Oct 6, 2024 •

edited

Loading

erratic-pattern commented Oct 7, 2024

tustvold commented Oct 7, 2024

erratic-pattern commented Oct 7, 2024

tustvold commented Oct 7, 2024

alamb commented Oct 8, 2024

object_store: retry on response decoding errors #6519

Are you sure you want to change the base?

object_store: retry on response decoding errors #6519

Conversation

erratic-pattern commented Oct 6, 2024

tustvold commented Oct 6, 2024 • edited Loading

erratic-pattern commented Oct 7, 2024

tustvold commented Oct 7, 2024

erratic-pattern commented Oct 7, 2024

tustvold commented Oct 7, 2024

alamb commented Oct 8, 2024

tustvold commented Oct 6, 2024 •

edited

Loading