Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSC2716: Incrementally importing history into existing rooms #2716

Open
wants to merge 69 commits into
base: old_master
Choose a base branch
from
Open
Changes from 4 commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
8c8d5e3
MSC2716: Incrementally importing history into existing rooms
ara4n Aug 4, 2020
3a03172
note that we don't solve lazyloading history from ASes
ara4n Aug 4, 2020
5e6b7b9
add another alternative
ara4n Aug 4, 2020
9451439
s/parent/prev_event/ for consistency with SS API
ara4n Aug 11, 2020
8668e3a
Add initial draft of alternative batch sending historical messages
MadLittleMods Jul 22, 2021
d40f3b9
Update with chunk events
MadLittleMods Jul 22, 2021
5854ca2
Add note about adding m.historical
MadLittleMods Jul 22, 2021
b766be5
Only connect the base insertion event to the specified prev_event
MadLittleMods Jul 29, 2021
b448452
Start of consolidation and adding more clear information
MadLittleMods Aug 7, 2021
8a4d136
Wrap lines
MadLittleMods Aug 7, 2021
92f87ed
Add remaining alternatives
MadLittleMods Aug 7, 2021
bb44a63
Correct stable endpoint location
MadLittleMods Aug 7, 2021
3367f56
Reading pass
MadLittleMods Aug 7, 2021
a4c474e
Fix casing typo
MadLittleMods Sep 7, 2021
9df8b6e
Split out meta MSC2716 events into their own fields
MadLittleMods Sep 17, 2021
b3b7903
Remove outdated comment on current iteration of spec
MadLittleMods Sep 17, 2021
38bebb1
Use more obvious query param name
MadLittleMods Sep 17, 2021
7df80fe
Rename from chunks to batches
MadLittleMods Oct 13, 2021
3f28588
Add graph to show how historical state plays into the DAG
MadLittleMods Oct 14, 2021
80e68bc
Add server detection support
MadLittleMods Dec 15, 2021
1678282
Prefer empty prev_events=[] over fake prev_events
MadLittleMods Feb 7, 2022
c6a60b1
GitHub now supports mermaid natively
MadLittleMods Feb 16, 2022
65e5f7b
Incorporate in feedback
MadLittleMods Apr 14, 2022
d7cf789
Fix little mistakes
MadLittleMods May 12, 2022
d016b7d
Emphasize has *all* history as it's the key differentiator for that s…
MadLittleMods Jun 3, 2022
2544a3f
Address markers being lost in timeline gaps (marker events as state)
MadLittleMods Jun 3, 2022
7258f64
Formatting
MadLittleMods Jun 3, 2022
a828de3
Small typos and other fixes
anoadragon453 Aug 9, 2022
b2b5b54
Use json5
MadLittleMods Aug 9, 2022
a7920fb
Remove base graph
MadLittleMods Aug 10, 2022
73f4143
?ts is now specced
MadLittleMods Aug 10, 2022
92a7658
Say explicit for current room version
MadLittleMods Aug 10, 2022
1cf7395
Say which query parameters are optional vs required
MadLittleMods Aug 10, 2022
2433dfa
"Live timeline" so it's obvious
MadLittleMods Aug 10, 2022
efbee43
Fix stable versions
MadLittleMods Aug 10, 2022
b4ba8c4
Feature needs to be true
MadLittleMods Aug 10, 2022
7cde5cd
More clear phrasing
MadLittleMods Aug 10, 2022
8a50cbf
Add annotation that older events are at the top of the graphs
MadLittleMods Aug 10, 2022
8286ca4
A --> B is just where you want to import between
MadLittleMods Aug 10, 2022
850e2f1
More than one reason for new room version
MadLittleMods Aug 10, 2022
3332ca8
unioned with the state at the prev_event_id
MadLittleMods Aug 10, 2022
3668193
State events are allowed
MadLittleMods Aug 10, 2022
c936c7b
Merge branch 'matthew/msc2716' of github.com:matrix-org/matrix-doc in…
MadLittleMods Aug 10, 2022
1d3f562
Unsaved merge conflict
MadLittleMods Aug 10, 2022
e593c20
More clear image
MadLittleMods Aug 10, 2022
2c46547
Just use event types when referring to them
MadLittleMods Aug 10, 2022
f60c233
Self-referential batches descoped to another MSC
MadLittleMods Aug 10, 2022
b081ec7
Re-arrange to explain events and fields in table
MadLittleMods Aug 12, 2022
d20455f
Fix some table formatting and better full examples
MadLittleMods Aug 12, 2022
a8313bd
Link to depth discussion
MadLittleMods Aug 12, 2022
9d96c5c
Wrapping
MadLittleMods Aug 12, 2022
991bd84
Fix direction
MadLittleMods Aug 17, 2022
55551fc
Remove namespace beacuse the event type is already the namespace
MadLittleMods Aug 17, 2022
1ee23d4
Fix heading structure and more words to describe the historical property
MadLittleMods Aug 24, 2022
e7e435d
Explain when the messages in the example were sent
MadLittleMods Aug 24, 2022
775d3d3
Clarify that you provide it next time
MadLittleMods Aug 24, 2022
6ccecd7
Clarify how it connects
MadLittleMods Aug 24, 2022
af24a5f
Fix endpoint path (no unstable)
MadLittleMods Aug 24, 2022
10599cb
Add heading for new event types
MadLittleMods Aug 24, 2022
a6b5d8f
Use m.room.insertion type name in mermaid graphs
MadLittleMods Aug 24, 2022
0642d88
Remove backticks from mermaid graphs
MadLittleMods Aug 24, 2022
b18c214
Add more initial explanation
MadLittleMods Aug 24, 2022
02b5f4b
Better DAG to match expectation image
MadLittleMods Aug 24, 2022
69bd287
Add example why you would use the historical content property
MadLittleMods Aug 25, 2022
5412e80
Remove "full"
MadLittleMods Aug 25, 2022
16a6a40
Fix historical typo
MadLittleMods Aug 25, 2022
4a8f834
Explain that /batch_send does the insertion/batch dance for you
MadLittleMods Aug 25, 2022
e4193ff
Make it more clear what the drawbacks are
MadLittleMods Apr 13, 2023
1fc8b6b
with should be without
MadLittleMods Apr 13, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions proposals/2716-importing-history-into-existing-rooms.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# MSC2716: Incrementally importing history into existing rooms
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

## Problem

Matrix has historically been unable to easily import existing history into a
room that already exists. This is a major problem when bridging existing
conversations into Matrix, particularly if the scrollback is being
incrementally or lazily imported.

For instance, an NNTP bridge might work by letting a user join a room that
maps to a given newsgroup, first showing an empty room, and then importing the
most recent 1000 newsgroup posts for that room to flesh out some history. The
bridge might then choose to slowly import additional posts for that newsgroup
in the background, until however many decades of backfill were complete.
Finally, as more archives surface, they might also need to be manually
gradually added into the history of the room - slowly building up the complete
history of the conversations over time.

This is currently not supported because:
* There is no way to set historical room state in a room via the CS or AS API -
you can only edit current room state.
* There is no way to create messages in the context of historical room state in
a room via CS or AS API - you can only create events relative to current room
state.
* There is currently no way to override the timestamp on an event via the AS API.
(We used to have the concept of [timestamp
massaging](https://matrix.org/docs/spec/application_service/r0.1.2#timestamp-massaging),
but it never got properly specified)

## Proposal
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
1. We let the AS API override the prev_event(s) of an event when injecting it into
the room, thus letting bridges consciously specify the topological ordering of
the room DAG. We do this by adding a `prev_event` querystring parameter on the
`PUT /_matrix/client/r0/rooms/{roomId}/send/{eventType}/{txnId}` and
`PUT /_matrix/client/r0/rooms/{roomId}/state/{eventType}/{stateKey}` endpoints.
The `prev_event` parameter can be repeated multiple times to specify multiple parent
event IDs of the event being submitted. An event must not have more than 20 prev_events.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
If a `prev_event` parameter is not presented, the server assumes the event is being
appended to the current timeline and calculates the prev_events as normal. If an
unrecognised event ID is specified as a `prev_event`, the request fails with a 404.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

2. We also let the AS API override ('massage') the `origin_server_ts` timestamp applied
to sent events. We do this by adding a `ts` querystring parameter on the
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
`PUT /_matrix/client/r0/rooms/{roomId}/send/{eventType}/{txnId}` and
`PUT /_matrix/client/r0/rooms/{roomId}/state/{eventType}/{stateKey}`endpoints, specifying
the value to apply to `origin_server_ts` on the event (UNIX epoch milliseconds).
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

3. Finally, we can add a optional `"m.historical": true` field to events to
Half-Shot marked this conversation as resolved.
Show resolved Hide resolved
indicate that they are historical at the point of being added to a room, and
as such servers should not serve them to clients via the CS `/sync` API -
instead preferring clients to discover them by paginating scrollback via
`/messages`.

This lets history be injected at the right place topologically in the room. For instance, different eras of the room could
end up as branches off the original `m.room.create` event, each first setting up the contextual room state for that era before
the block of imported history. So, you could end up with something like this:

```
m.room.create
|\
| \___________________________________
| \ \
| \ \
live timeline previous 1000 messages another block of ancient history
Copy link
Member

@kegsay kegsay Dec 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backfilling via /messages works by walking back up prev_events. If the DAG looks like this, we'll never hit different eras so /messages will return 0 events.

EDIT: Actually it uses depth which will interleave instead. /get_missing_events will however walk up prev events, so all these lovely eras will never make it to other federated servers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works because the /messages endpoint has no idea when to jump to a different era. That endpoint topologically walks the DAG (in Dendrite it does this by depth), meaning if you actually did this you would get interleaved events as each era's events start producing the same depth values. This at least returns the events in the forks, but not where you want them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, doing this would produce forwards extremities at the end of each era, which servers will attempt to merge.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so my expectation is that a homeserver should calculate the an appropriate depth when importing history like this, probably by tiebreaking based on origin_server_ts. Where does Dendrite get its depth param from? As it certainly shouldn't be trusting the one it receives over federation, because of https://github.com/matrix-org/matrix-doc/issues/1229.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, the fact that we create forward extremities at the end of each era which then get merged by the next message sent in the room was intended to be a feature, not a bug.

Copy link
Contributor

@MadLittleMods MadLittleMods Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our meeting, the depth is assumed from the stream ID and can be spoofed. I may not have the details correct but we did discuss fudging it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So having rediscussed this IRL: Dendrite (and Synapse) currently get their depth parameters used for ordering from the wire. Ideally, we'd calculate the depth parameter instead - which could be easy, if we mandate that blocks of old history are always loaded contiguously in reverse chronological order. As a quick fudge to test the approach however we could set depth=1 for these events, and hopefully the default ordering will be sufficient (we think it is on synapse, but dendrite might need a tweak).

MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
```

We consciously don't support the new `parent` and `ts` parameters on the
various helper syntactic-sugar APIs like `/kick` and `/ban`. If a bridge/bot is
smart enough to be faking history, it is already in the business of dealing
with raw events, and should not be using the syntactic sugar APIs.

## Potential issues

There are a bunch of security considerations here - see below.

This doesn't provide a way for a HS to tell an AS that a client has tried to call
/messages beyond the beginning of a room, and that the AS should try to
lazy-insert some more messages (as per https://github.com/matrix-org/matrix-doc/issues/698).
For this MSC to be properly useful, we might want to flesh that out.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

## Alternatives

We could insist that we use the SS API to import history history in this manner rather than
extending the AS API. However, it seems unnecessarily burdensome to make bridge authors
ara4n marked this conversation as resolved.
Show resolved Hide resolved
understand the SS API, especially when we already have so many AS API bridges. Hence these
minor extensions to the existing AS API.

Another way of doing this might be to store the different eras of the room as
different versions of the room, using `m.room.tombstone` events to form a
linked list of the eras. This has the advantage of isolating room state
between different eras of the room, simplifying state resolution calculations
and avoiding risk of any cross-talk. It's also easier to reason about, and
avoids exposing the DAG to bridge developers. However, it would require
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
better presentation of room versions in clients, and it would require support
for retrospectively specifying the `predecessor` of the current room when you
retrospectively import history. Currently `predecessor` is in the immutable
`m.room.create` event of a room, so cannot be changed retrospectively - and
doing so in a safe and race-free manner sounds Hard.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved

Another way could be to let the server who issued the m.room.create also go
and retrospectively insert events into the room outside the context of the DAG
(i.e. without parent prev_events or signatures). To quote the original
[bug](https://github.com/matrix-org/matrix-doc/issues/698#issuecomment-259478116):

> You could just create synthetic events which look like normal DAG events but
exist before the m.room.create event. Their signatures and prev-events would
all be missing, but they would be blindly trusted based on the HS who is
allowed to serve them (based on metadata in the m.room.create event). Thus
you'd have a perimeter in the DAG beyond which events are no longer
decentralised or signed, but are blindly trusted to let HSes insert ancient
history provided by ASes.

However, this feels needlessly complicated if the DAG approach is sufficient.

## Security considerations

This allows an AS to tie the room DAG in knots by specifying inappropriate
event IDs as parents, potentially DoSing the state resolution algorithm, or
triggering undesired state resolution results. This is already possible by the
SS API today however, and given AS API requires the homeserver admin to
explicitly authorise the AS in question, this doesn't feel too bad.

This also makes it much easier for an AS to maliciously spoof history. This
is a bit unavoidable given the nature of the feature, and is also possible
today via SS API.

If the state changes from under us due to importing history, we have no way to
tell the client about it. This is an [existing
bug](https://github.com/matrix-org/synapse/issues/4508) that can be triggered
today by SS API traffic, so is orthogonal to this proposal.

## Unstable prefix

Feels unnecessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to have an unstable feature flag to check if the homeserver supports this