-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Handle race between persisting an event and un-partial stating a room #13100
Handle race between persisting an event and un-partial stating a room #13100
Conversation
Catch `IntegrityError`s instead of `DatabaseError`s and downgrade the log message to INFO.
Define a `PartialStateConflictError` exception, to be raised when persisting a partial state event into an un-partial stated room. Signed-off-by: Sean Quah <seanq@matrix.org>
…vent in an un-partial stated room Raise a `PartialStateConflictError` in `PersistEventsStore.store_event_state_mappings_txn` when we try to persist a partial state event in an un-partial stated room. Also document the exception in the docstrings for `PersistEventsStore._persist_events_and_state_updates`, `_persist_events_txn` and `_update_outliers_txn`. Signed-off-by: Sean Quah <seanq@matrix.org>
…roller` Update the docstrings for `persist_event`, `persist_events` and `_persist_event_batch`. Signed-off-by: Sean Quah <seanq@matrix.org>
…_and_notify_client_event` The `PartialStateConflictError` comes from the call to `EventsPersistenceStorageController.persist_event` in the middle of the method. Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
Instead of sprinkling retries for client events all over the place, raise a 503 Service Unavailable in `EventCreationHandler.handle_new_client_event`, which all client events go through. A 503 is usually temporary and it is hoped that clients will retry whatever they are doing. Signed-off-by: Sean Quah <seanq@matrix.org>
…tion Signed-off-by: Sean Quah <seanq@matrix.org>
…ess_remote_join` Convert `PartialStateConflictError`s to 503s when processing remote joins. We make no attempt to handle the error correctly, since it can only occur on additional joins into partial state rooms, which isn't supported yet. Signed-off-by: Sean Quah <seanq@matrix.org>
…push_actions_and_persist_event` The `PartialStateConflictError` comes from the call to `persist_events_and_notify` near the end. Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
Retry `_process_received_pdu` on `PartialStateConflictError` in `FederationEventHandler.on_receive_pdu`. Document `PartialStateConflictError` in the docstring for `FederationEventHandler._processed_received_pdu`. The exception can come from the call to `FederationEventHandler._run_push_actions_and_persist_event`. We ignore `FederationEventHandler._process_pulled_event`, because those events should not be persisted with partial state. Signed-off-by: Sean Quah <seanq@matrix.org>
ed23f1d
to
8af0caa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This basically looks good to me.
That said:
- I'm not sure what your question is on about — sorry :/
- I'm not a fan giving 503s to clients because they hit a race (but I could be convinced that this can be fixed with a later PR; in that case, maybe it's just time to add a TODO and open an issue?)
This error should not be exposed to clients. | ||
""" | ||
|
||
def __init__(self) -> None: | ||
super().__init__( | ||
HTTPStatus.CONFLICT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This being a SynapseError
(one with an HTTP status code) if we shouldn't expose it to clients confused me.
I think you're doing this for replication reasons; maybe it would be worth noting that in the docstring to explain why it has a special HTTP response code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll update the docstring.
synapse/handlers/federation.py
Outdated
"Room %s was un-partial stated while processing remote join.", | ||
room_id, | ||
) | ||
raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this is a bad response code to return because it makes e.g. CloudFlare start to treat us as being down.
I also am not the biggest fan of requiring the client to retry it personally. I think I'd like to see it automatically retried, but appreciate that will be a pain. :/
Maybe the right thing to do is accept this for now, but TODO it/open issue to get around to retrying this in a separate PR for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a more appropriate status code that we can use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still undecided about what to do about the client paths. There's just so many of them :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone for 429 Too Many Requests with a retry_after_ms
of 0
, which has the closest semantics to what we want and ought to avoid any reverse proxy weirdness.
I think it's okay to expect clients to retry requests. They have to have retry logic anyway, for robustness in the face of bad connectivity. I wouldn't expect this extra case to add more complexity to clients.
"Room %s was un-partial stated during `on_send_membership_event`, trying again.", | ||
room_id, | ||
) | ||
return await self._federation_event_handler.on_send_membership_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so to clarify, there should only ever be one retry needed because once it's un-partial-stated, it can't conflict anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, a room can only be un-partial-stated once.
Unless we leave it or purge it, but I don't know what happens in that case, even in the absence of faster room joins.
except self.db_pool.engine.module.IntegrityError as e: | ||
# Assume that any `IntegrityError`s are due to partial state events. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wish we could get some way of narrowing this down so we don't have to assume it, but I can't see a way short of matching the error string, which sounds very dodgy.
synapse/handlers/message.py
Outdated
) | ||
).addErrback(unwrapFirstError) | ||
raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(another instance of the 503 that I'm not a fan of; see other thread)
Which question? |
Apparently I wrote some comments a week ago but forgot to publish them... |
except SynapseError as e: | ||
if e.code == HTTPStatus.CONFLICT: | ||
raise PartialStateConflictError() | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we transport the PartialStateConflictError
across replication is pretty ugly. I'm open to alternative suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels ugly but it also is straight forward so it has that going for it. I think it's fine and we can always change it later
|
||
def __init__(self) -> None: | ||
super().__init__( | ||
HTTPStatus.CONFLICT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 409 Conflict
status code here is from the perspective of replication: the replication /send_event
or /fed_send_events
request includes the event context with the partial state flag, which is in conflict with the current state of the homeserver (and it makes no sense to retry the request).
"Room %s was un-partial stated during `on_send_membership_event`, trying again.", | ||
room_id, | ||
) | ||
return await self._federation_event_handler.on_send_membership_event( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, a room can only be un-partial-stated once.
Unless we leave it or purge it, but I don't know what happens in that case, even in the absence of faster room joins.
The latter can make reverse proxies do unwanted things.
# TODO(faster_joins): `_should_perform_remote_join` suggests that we may | ||
# do a remote join for restricted rooms even if we have full state. | ||
logger.error( | ||
"Room %s was un-partial stated while processing remote join.", | ||
room_id, | ||
) | ||
raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode) | ||
raise LimitExceededError(msg=e.msg, errcode=e.errcode, retry_after_ms=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we consider opening an issue as to talk about whether we want to make this better?
I see what you're saying but it still feels to me that this is worth thinking about (but I don't want to block this PR on that).
(For sending messages, some clients seem to prompt you to retry sending the message if it fails, I'm not sure about the exact circumstances but leaving it like this means we'll want to check that, so perhaps defer it to an issue regardless)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's fair. I've filed it as #13173.
except SynapseError as e: | ||
if e.code == HTTPStatus.CONFLICT: | ||
raise PartialStateConflictError() | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it feels ugly but it also is straight forward so it has that going for it. I think it's fine and we can always change it later
CI's failing, pending the merge of #402. |
…joins_fix_departial_stating_race
|
Handle race between persisting an event and un-partial stating a room
Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.
We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special
PartialStateConflictError
exception,which makes its way up to the method in which we computed the event
context.
To make things difficult, the exception needs to cross a replication
request:
/fed_send_events
for events coming over federation and/send_event
for events from clients. We transport thePartialStateConflictError
as a409 Conflict
over replication andturn
409
s back intoPartialStateConflictError
s on the worker makingthe request.
All client events go through
EventCreationHandler.handle_new_client_event
, which is called ina lot of places. Instead of trying to update all the code which
creates client events, we turn the
PartialStateConflictError
into a429 Too Many Requests
inEventCreationHandler.handle_new_client_event
and hope that clientstake it as a hint to retry their request.
On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
FederationEventHandler._auth_and_persist_outliers_inner
,FederationHandler.do_knock
,FederationHandler.on_invite_request
andFederationHandler.do_remotely_reject_invite
. These events won't havethe partial state flag, so we do not need to do anything for then.
The remaining 3 paths which create events are
FederationEventHandler.process_remote_join
,FederationEventHandler.on_send_membership_event
andFederationEventHandler._process_received_pdu
.We can't experience the race in
process_remote_join
, unless we'rehandling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.
on_send_membership_event
is only called byFederationServer._on_send_membership_event
, so we catch thePartialStateConflictError
there and retry just once._process_received_pdu
is called byon_receive_pdu
for incomingevents and
_process_pulled_event
for backfill. The latter should nevertry to persist partial state events, so we ignore it. We catch the
PartialStateConflictError
inon_receive_pdu
and retry just once.Refering to the graph of code paths in
#12988 (comment)
may make the above make more sense.
Can be reviewed commit by commit, though it's still easy to get lost.
Refer to the handy picture above to figure out where things fit in.