Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Handle race between persisting an event and un-partial stating a room #13100

Merged
merged 18 commits into from
Jul 5, 2022

Conversation

squahtx
Copy link
Contributor

@squahtx squahtx commented Jun 17, 2022

Handle race between persisting an event and un-partial stating a room

Whenever we want to persist an event, we first compute an event context,
which includes the state at the event and a flag indicating whether the
state is partial. After a lot of processing, we finally try to store the
event in the database, which can fail for partial state events when the
containing room has been un-partial stated in the meantime.

We detect the race as a foreign key constraint failure in the data store
layer and turn it into a special PartialStateConflictError exception,
which makes its way up to the method in which we computed the event
context.

To make things difficult, the exception needs to cross a replication
request: /fed_send_events for events coming over federation and
/send_event for events from clients. We transport the
PartialStateConflictError as a 409 Conflict over replication and
turn 409s back into PartialStateConflictErrors on the worker making
the request.

All client events go through
EventCreationHandler.handle_new_client_event, which is called in
a lot of places. Instead of trying to update all the code which
creates client events, we turn the PartialStateConflictError into a
429 Too Many Requests in
EventCreationHandler.handle_new_client_event and hope that clients
take it as a hint to retry their request.

On the federation event side, there are 7 places which compute event
contexts. 4 of them use outlier event contexts:
FederationEventHandler._auth_and_persist_outliers_inner,
FederationHandler.do_knock, FederationHandler.on_invite_request and
FederationHandler.do_remotely_reject_invite. These events won't have
the partial state flag, so we do not need to do anything for then.

The remaining 3 paths which create events are
FederationEventHandler.process_remote_join,
FederationEventHandler.on_send_membership_event and
FederationEventHandler._process_received_pdu.

We can't experience the race in process_remote_join, unless we're
handling an additional join into a partial state room, which currently
blocks, so we make no attempt to handle it correctly.

on_send_membership_event is only called by
FederationServer._on_send_membership_event, so we catch the
PartialStateConflictError there and retry just once.

_process_received_pdu is called by on_receive_pdu for incoming
events and _process_pulled_event for backfill. The latter should never
try to persist partial state events, so we ignore it. We catch the
PartialStateConflictError in on_receive_pdu and retry just once.

Refering to the graph of code paths in
#12988 (comment)
may make the above make more sense.


graphviz

Can be reviewed commit by commit, though it's still easy to get lost.
Refer to the handy picture above to figure out where things fit in.

Sean Quah added 12 commits June 17, 2022 13:56
Catch `IntegrityError`s instead of `DatabaseError`s and downgrade the
log message to INFO.
Define a `PartialStateConflictError` exception, to be raised when
persisting a partial state event into an un-partial stated room.

Signed-off-by: Sean Quah <seanq@matrix.org>
…vent in an un-partial stated room

Raise a `PartialStateConflictError` in
`PersistEventsStore.store_event_state_mappings_txn` when we try to
persist a partial state event in an un-partial stated room.

Also document the exception in the docstrings for
`PersistEventsStore._persist_events_and_state_updates`,
`_persist_events_txn` and `_update_outliers_txn`.

Signed-off-by: Sean Quah <seanq@matrix.org>
…roller`

Update the docstrings for `persist_event`, `persist_events` and
`_persist_event_batch`.

Signed-off-by: Sean Quah <seanq@matrix.org>
…_and_notify_client_event`

The `PartialStateConflictError` comes from the call to
`EventsPersistenceStorageController.persist_event` in the middle of the
method.

Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
Instead of sprinkling retries for client events all over the place,
raise a 503 Service Unavailable in
`EventCreationHandler.handle_new_client_event`, which all client events
go through. A 503 is usually temporary and it is hoped that clients will
retry whatever they are doing.

Signed-off-by: Sean Quah <seanq@matrix.org>
…tion

Signed-off-by: Sean Quah <seanq@matrix.org>
…ess_remote_join`

Convert `PartialStateConflictError`s to 503s when processing remote
joins. We make no attempt to handle the error correctly, since it can
only occur on additional joins into partial state rooms, which isn't
supported yet.

Signed-off-by: Sean Quah <seanq@matrix.org>
…push_actions_and_persist_event`

The `PartialStateConflictError` comes from the call to
`persist_events_and_notify` near the end.

Signed-off-by: Sean Quah <seanq@matrix.org>
Signed-off-by: Sean Quah <seanq@matrix.org>
Retry `_process_received_pdu` on `PartialStateConflictError` in
`FederationEventHandler.on_receive_pdu`.

Document `PartialStateConflictError` in the docstring for
`FederationEventHandler._processed_received_pdu`. The exception can come
from the call to
`FederationEventHandler._run_push_actions_and_persist_event`.

We ignore `FederationEventHandler._process_pulled_event`, because those
events should not be persisted with partial state.

Signed-off-by: Sean Quah <seanq@matrix.org>
@squahtx squahtx requested a review from a team as a code owner June 17, 2022 13:23
@squahtx squahtx force-pushed the squah/faster_room_joins_fix_departial_stating_race branch from ed23f1d to 8af0caa Compare June 17, 2022 13:26
@reivilibre reivilibre self-assigned this Jun 29, 2022
Copy link
Contributor

@reivilibre reivilibre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This basically looks good to me.

That said:

  • I'm not sure what your question is on about — sorry :/
  • I'm not a fan giving 503s to clients because they hit a race (but I could be convinced that this can be fixed with a later PR; in that case, maybe it's just time to add a TODO and open an issue?)

Comment on lines 80 to 85
This error should not be exposed to clients.
"""

def __init__(self) -> None:
super().__init__(
HTTPStatus.CONFLICT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This being a SynapseError (one with an HTTP status code) if we shouldn't expose it to clients confused me.

I think you're doing this for replication reasons; maybe it would be worth noting that in the docstring to explain why it has a special HTTP response code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update the docstring.

"Room %s was un-partial stated while processing remote join.",
room_id,
)
raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC this is a bad response code to return because it makes e.g. CloudFlare start to treat us as being down.
I also am not the biggest fan of requiring the client to retry it personally. I think I'd like to see it automatically retried, but appreciate that will be a pain. :/

Maybe the right thing to do is accept this for now, but TODO it/open issue to get around to retrying this in a separate PR for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more appropriate status code that we can use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still undecided about what to do about the client paths. There's just so many of them :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone for 429 Too Many Requests with a retry_after_ms of 0, which has the closest semantics to what we want and ought to avoid any reverse proxy weirdness.

I think it's okay to expect clients to retry requests. They have to have retry logic anyway, for robustness in the face of bad connectivity. I wouldn't expect this extra case to add more complexity to clients.

"Room %s was un-partial stated during `on_send_membership_event`, trying again.",
room_id,
)
return await self._federation_event_handler.on_send_membership_event(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so to clarify, there should only ever be one retry needed because once it's un-partial-stated, it can't conflict anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, a room can only be un-partial-stated once.
Unless we leave it or purge it, but I don't know what happens in that case, even in the absence of faster room joins.

Comment on lines +1128 to +1129
except self.db_pool.engine.module.IntegrityError as e:
# Assume that any `IntegrityError`s are due to partial state events.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we could get some way of narrowing this down so we don't have to assume it, but I can't see a way short of matching the error string, which sounds very dodgy.

)
).addErrback(unwrapFirstError)
raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(another instance of the 503 that I'm not a fan of; see other thread)

synapse/handlers/federation.py Outdated Show resolved Hide resolved
@reivilibre reivilibre removed their assignment Jun 29, 2022
@squahtx
Copy link
Contributor Author

squahtx commented Jun 29, 2022

I'm not sure what your question is on about — sorry :/

Which question?

@squahtx
Copy link
Contributor Author

squahtx commented Jun 29, 2022

Apparently I wrote some comments a week ago but forgot to publish them...

Comment on lines +1376 to +1379
except SynapseError as e:
if e.code == HTTPStatus.CONFLICT:
raise PartialStateConflictError()
raise
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we transport the PartialStateConflictError across replication is pretty ugly. I'm open to alternative suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels ugly but it also is straight forward so it has that going for it. I think it's fine and we can always change it later


def __init__(self) -> None:
super().__init__(
HTTPStatus.CONFLICT,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 409 Conflict status code here is from the perspective of replication: the replication /send_event or /fed_send_events request includes the event context with the partial state flag, which is in conflict with the current state of the homeserver (and it makes no sense to retry the request).

synapse/handlers/message.py Outdated Show resolved Hide resolved
synapse/handlers/federation.py Outdated Show resolved Hide resolved
"Room %s was un-partial stated during `on_send_membership_event`, trying again.",
room_id,
)
return await self._federation_event_handler.on_send_membership_event(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, a room can only be un-partial-stated once.
Unless we leave it or purge it, but I don't know what happens in that case, even in the absence of faster room joins.

@squahtx squahtx requested a review from reivilibre July 1, 2022 20:13
# TODO(faster_joins): `_should_perform_remote_join` suggests that we may
# do a remote join for restricted rooms even if we have full state.
logger.error(
"Room %s was un-partial stated while processing remote join.",
room_id,
)
raise SynapseError(HTTPStatus.SERVICE_UNAVAILABLE, e.msg, e.errcode)
raise LimitExceededError(msg=e.msg, errcode=e.errcode, retry_after_ms=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we consider opening an issue as to talk about whether we want to make this better?
I see what you're saying but it still feels to me that this is worth thinking about (but I don't want to block this PR on that).
(For sending messages, some clients seem to prompt you to retry sending the message if it fails, I'm not sure about the exact circumstances but leaving it like this means we'll want to check that, so perhaps defer it to an issue regardless)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's fair. I've filed it as #13173.

Comment on lines +1376 to +1379
except SynapseError as e:
if e.code == HTTPStatus.CONFLICT:
raise PartialStateConflictError()
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it feels ugly but it also is straight forward so it has that going for it. I think it's fine and we can always change it later

@squahtx
Copy link
Contributor Author

squahtx commented Jul 4, 2022

CI's failing, pending the merge of #402.

@squahtx
Copy link
Contributor Author

squahtx commented Jul 5, 2022

TestRestrictedRoomsLocalJoin and TestSendJoinPartialStateResponse are known worker mode flakes: #13161

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants