Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

retry federation requests on 5xx errors #8915

Closed
richvdh opened this issue Dec 10, 2020 · 8 comments · Fixed by #9567
Closed

retry federation requests on 5xx errors #8915

richvdh opened this issue Dec 10, 2020 · 8 comments · Fixed by #9567
Labels
good first issue Good for newcomers z-bug (Deprecated Label) z-p2 (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Dec 10, 2020

federationhttpclient has a bunch of logic to retry requests, but it only fires for connection failures, 429 responses, and (optionally) dns lookup errors. Why don't we retry on 5xx errors?

(For example: we sometimes see 520 errors from matrix.org. Arguably that's indicative of a problem, but I think we could be robust to it.)

@erikjohnston
Copy link
Member

I think this is more of a thinko that we should just fix. I think we should just extend:

# Retry if the error is a 429 (Too Many Requests),
# otherwise just raise a standard HttpResponseException
if response.code == 429:
raise RequestSendFailed(exc, can_retry=True) from exc
else:
raise exc

Though I wonder if we should not retry quite so often on 500 errors, as they should(?) be more likely due to bad request, but that isn't always true.

@richvdh
Copy link
Member Author

richvdh commented Dec 15, 2020

Though I wonder if we should not retry quite so often on 500 errors, as they should(?) be more likely due to bad request

a bad request should cause a 400 error :-p

@richvdh
Copy link
Member Author

richvdh commented Dec 15, 2020

In fact I would say the very thing that distinguishes a 500 from a 400 is that a 500 says "please try the same thing again, you might have more luck" (whereas a 400 says "stop doing that please")

@erikjohnston
Copy link
Member

Yeah, that's fair

@y-pankaj
Copy link
Contributor

Hi! I would like to take up this issue.
As mentioned before by @erikjohnston, should I just extend the if statement and change the comment as well?

@turt2live
Copy link
Member

503 backing off for 10 minutes is also a bit harsh:

[federation_sender_1] 2021-01-26 22:33:41,773 - synapse.http.matrixfederationclient - 217 - INFO - federation_transaction_transmission_loop-93440- {PUT-O-78425} [fosdem.org] Completed request: 200 OK in 0.07 secs - PUT matrix://fosdem.org/_matrix/federation/v1/send/1611673272139
[federation_sender_1] 2021-01-26 22:33:41,773 - synapse.federation.sender.transaction_manager - 163 - INFO - federation_transaction_transmission_loop-93440- TX [fosdem.org] {1611673272139} got 200 response
[federation_sender_1] 2021-01-26 22:33:45,254 - synapse.federation.sender.transaction_manager - 127 - INFO - federation_transaction_transmission_loop-93445- TX [fosdem.org] {1611673272144} Sending transaction [1611673272144], (PDUs: 0, EDUs: 1)
[federation_sender_1] 2021-01-26 22:33:45,281 - synapse.http.matrixfederationclient - 217 - INFO - federation_transaction_transmission_loop-93445- {PUT-O-78430} [fosdem.org] Completed request: 200 OK in 0.03 secs - PUT matrix://fosdem.org/_matrix/federation/v1/send/1611673272144
[federation_sender_1] 2021-01-26 22:33:45,282 - synapse.federation.sender.transaction_manager - 163 - INFO - federation_transaction_transmission_loop-93445- TX [fosdem.org] {1611673272144} got 200 response
[federation_sender_1] 2021-01-26 22:33:55,234 - synapse.federation.sender.transaction_manager - 127 - INFO - federation_transaction_transmission_loop-93449- TX [fosdem.org] {1611673272148} Sending transaction [1611673272148], (PDUs: 0, EDUs: 1)
[federation_sender_1] 2021-01-26 22:33:55,248 - synapse.http.matrixfederationclient - 508 - INFO - federation_transaction_transmission_loop-93449- {PUT-O-78434} [fosdem.org] Got response headers: 503 Service Temporarily Unavailable
[federation_sender_1] 2021-01-26 22:33:55,249 - synapse.http.matrixfederationclient - 586 - WARNING - federation_transaction_transmission_loop-93449- {PUT-O-78434} [fosdem.org] Request failed: PUT matrix://fosdem.org/_matrix/federation/v1/send/1611673272148: HttpResponseException('503: Service Temporarily Unavailable',)
[federation_sender_1] 2021-01-26 22:33:55,249 - synapse.util.retryutils - 215 - INFO - federation_transaction_transmission_loop-93449- Connection to fosdem.org was unsuccessful (<class 'synapse.api.errors.HttpResponseException'>(503: Service Temporarily Unavailable)); backoff now 600000
[federation_sender_1] 2021-01-26 22:33:55,249 - synapse.federation.sender.transaction_manager - 159 - INFO - federation_transaction_transmission_loop-93449- TX [fosdem.org] {1611673272148} got 503 response
[federation_sender_1] 2021-01-26 22:33:55,250 - synapse.federation.sender.per_destination_queue - 411 - WARNING - federation_transaction_transmission_loop-93449- TX [fosdem.org] Received 503 response to transaction: 503: Service Temporarily Unavailable

My server was happily talking to fosdem.org for a long while, but then fosdem.org was restarted which lead to a single 503 - the backoff is a bit harsh here.

@ShadowJonathan
Copy link
Contributor

Why is the backoff a flat 10 minutes, anyways? Why not exponential backoff - from 10 seconds - that caps at around 15 minutes?

@richvdh
Copy link
Member Author

richvdh commented Feb 9, 2021

it is exponential. It starts at 10 minutes because it assumes (incorrectly, per this issue) that by that point you've already tried a few times. See also #5406 (comment) for somewhere I've written about this in the past.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
good first issue Good for newcomers z-bug (Deprecated Label) z-p2 (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants