Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processing send_join races with local users sending messages #17720

Open
nico-famedly opened this issue Sep 17, 2024 · 0 comments
Open

processing send_join races with local users sending messages #17720

nico-famedly opened this issue Sep 17, 2024 · 0 comments

Comments

@nico-famedly
Copy link
Contributor

Description

When a remote server was invited to a room and joins the room, they may not receive some messages if those are sent while the server processes the send_join. These events are only later backfilled when the local server sends a new message. This can be quite confusing, since you usually sent some messages and are waiting for a response, but the other side will not have all the messages, so might not respond yet.

I sadly can't include any logs, since those include customer data, but the relevant information seems to be this:

We have the following events:

  • An invite before all the following events
  • event A, message, prev events vary
  • event B1, message, prev events: A, streamid 9
  • event B2, join event, prev events: A, streamid 10
  • event C, message, prev events: B1, B2, streamid 11

B1 is the delayed event. It isn't sent out, since at the time the remote server is only invited, not joined. However since the remote server is already processing the join, it only fetches the prev events for B2 and is unaware that B1 exists. Later federation transactions from local to remote don't pick up B1, because the streamid is smaller, I think. That is until a new event is created (C in our case), which still won't send out B1, even though it is a prev_event, but does trigger backfill from the remote server (since it references B1 and B2).

The sequence of requests around the send_join are:

  • Received PUT to /_matrix/client/v3/rooms/!room/send/m.room.encrypted/txnid-1 (event B1) [worker d1]
  • Signing event B1 [worker d1]
  • Event auth allowing B1 [worker d1]
  • Start persist_events for B1 [worker d1]
  • Received make_join from remote server and make_join returns 200 to remote server [worker 0b]
  • persist_events TXN START and END, outliers get updated [worker d1]
  • (federation transmission loop finishes sending some presence events to remote [worker b6])
  • Received request to send_join (event B2) [worker d1]
  • persist_events TXN START and END for MultiWriterIdGenerator._update_table [worker d1]
  • Keyring fetch for remote server key [worker d1]
  • Return 200 for PUT to /_matrix/client/v3/rooms/!room/send/m.room.encrypted/txnid-1 [worker d1]
  • Remote key fetch done [worker d1]
  • Verify content hash of B2, on_send_membership_event [worker d1]
  • calculate state groups for B2 [worker d1]
  • soft fail checks for B2 [worker d1]
  • State resolution (1 conflicting entry) [worker d1]
  • Might drop extremities decides not to drop B1 [worker d1]
  • Persist outliers B2 [worker d1]
  • Return 200 to remote for send_join [worker 0b]
  • (federation transmission loop sends some EDUs to remote [worker b6])
  • Remote requests state ids for A, the event A and backfills from A [worker 0b]

There are also a lot of EDUs sent all the time to the remote server, which probably affects this issue somewhat, since it will likely also mark a streamid as processed?

Steps to reproduce

  • have a local room
  • invite a remote user
  • have the remote user join exactly when you are sending an event
  • The remote server won't receive that event until you send another event later

Homeserver

multiple

Synapse Version

1.85.2 (local), 1.107 (remote) (sadly it was an old server, where I managed to track it down at the right time, but it happens at low frequency on a lot of our servers)

Installation Method

Other (please mention below)

Database

postgres, no separate servers

Workers

Multiple workers

Platform

Docker, custom image with a few extra modules

Configuration

A few modules to validate invites, but those weren't executed in the relevant part of the issue.

Relevant log output

I can't provide those at the moment since they include customer data I am not allowed to share, but I included the relevant information in the description, which I hope is sufficient.

Anything else that would be useful to know?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant