You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
We run synapse with workers in Kubernetes, where event persister workers are represented as a StatefulSet. Events must be persisted by a specific worker, ie - having 2 event persisters does not enable high availability.
This means that during an update there is a small (~60s) window during which replication HTTP requests to a worker currently restarting will fail either with a DNS resolution failure (K8s has not created the new DNS record for the new pod) or some kind of could not connect failure due to the process not yet listening for traffic (need to confirm the exact exception here).
Therefore I think it'd be great to retry on such cases above as well, probably with some exponential backoff and a time limit so requests don't just hang indefinitely/timeout on the client. I'll have a go at implementing a POC roughly along these lines.
The text was updated successfully, but these errors were encountered:
This allows for the target process to be down for around a minute
which provides time for restarts during synapse upgrades/config updates.
Closes: #12178
Signed off by Nick Mills-Barrett nick@beeper.com
Fizzadar
added a commit
to Fizzadar/synapse
that referenced
this issue
Apr 25, 2022
This allows for the target process to be down for around a minute
which provides time for restarts during synapse upgrades/config updates.
Closes: matrix-org#12178
Signed off by Nick Mills-Barrett nick@beeper.com
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Description:
We run synapse with workers in Kubernetes, where event persister workers are represented as a
StatefulSet
. Events must be persisted by a specific worker, ie - having 2 event persisters does not enable high availability.This means that during an update there is a small (~60s) window during which replication HTTP requests to a worker currently restarting will fail either with a DNS resolution failure (K8s has not created the new DNS record for the new pod) or some kind of could not connect failure due to the process not yet listening for traffic (need to confirm the exact exception here).
There already exists logic to retry on timeouts, which is the default for all HTTP replication requests, which implies retrying requests is a safe operation.
Therefore I think it'd be great to retry on such cases above as well, probably with some exponential backoff and a time limit so requests don't just hang indefinitely/timeout on the client. I'll have a go at implementing a POC roughly along these lines.
The text was updated successfully, but these errors were encountered: