-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MSC2326: Label based filtering #2326
base: old_master
Are you sure you want to change the base?
Changes from all commits
3214e90
8c84d7b
cb0c68f
6627b00
32597a7
6f36f56
46d412e
78c4e16
05217cd
da7776f
158f11a
3a8f716
b080337
61f1396
d1110a2
45225af
f325203
a6d1249
a3450a6
7a21efd
4b7ca52
88c93dc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,219 @@ | ||
# Label based filtering | ||
|
||
## Problem | ||
|
||
Rooms often contain overlapping conversations, which Matrix should help users | ||
navigate. | ||
|
||
## Context | ||
|
||
We already have the concept of 'Replies' to define which messages are responses | ||
to which, which [MSC1849](https://github.com/matrix-org/matrix-doc/pull/1849) | ||
proposes extending into a generic mechanism for defining threads which could (in | ||
future) be paginated both depth-wise and breadth-wise. Meanwhile, | ||
[MSC1198](https://github.com/matrix-org/matrix-doc/issues/1198) is an alternate | ||
proposal for threading, which separates conversations into high-level "swim | ||
lanes" with a new `POST /rooms/{roomId}/thread` API. | ||
|
||
However, fully generic threading (which could be used to implement forum or | ||
email style semantics) runs a risk of being overly complicated to specify and | ||
implement and could result in feature creep. This is doubly true if you try to | ||
implement retrospective threading (e.g. to allow moderators to split off | ||
messages into their own thread, as you might do in a forum or to help manage | ||
conversation in a busy chatroom). | ||
|
||
Therefore, this is a simpler proposal to allow messages in a room to be filtered | ||
based on a given label in order to give basic one-layer-deep threading | ||
functionality. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Still, the 'reply to an existing message' is an important use case. A message cannot be replied to unless it is already labelled. Regular users cannot add labels to messages they did not author. Should a (unique) label to otherwise unlabelled messages be required? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Isn't that like replying to the message based on message ID? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. #1849 might be better suited for this kind of thing since a client could directly request the message that is referenced... Not sure though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @ptman I was thinking subsequent messages could reuse the existing label. Basically avoid nested threads. AFAICT #1849 proposes a strictly one-way relation (i.e. child points to parent). Wouldn't that make it expensive to list the thread starting from a particular message? Each subsequent message would have to be determined by reverse look-up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I personally prefer hierarchical threads, but either way, regardless of the how the relationships are recorded, they can be shown flat, just like apple mail and gmail do. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, but I was under the impression that this proposal specifically aims for flat threads for the sake of simplicity. |
||
|
||
## Proposal | ||
|
||
We let users specify an optional `m.labels` field onto the events. This field | ||
babolivier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
lists freeform text labels: | ||
|
||
```json | ||
{ | ||
// ... | ||
"m.labels": [ "somelabel" ] | ||
} | ||
``` | ||
|
||
The labels are expected to be insensitive to case, therefore clients are | ||
expected to lowercase them before sending them to servers. A label's length is | ||
limited to a maximum of 100 characters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. enforced how? What error code do I get for sending such an event? what if it's received over federation with >100 characters? |
||
|
||
Labels which are prefixed with # are expected to be user-visible and exposed to | ||
the user by clients as a hashtag, letting the user filter their current room by | ||
the various hashtags present within it. Labels which are not prefixed with # are | ||
expected to be hidden from the user by clients (so that they can be used as | ||
e.g. thread IDs bridged from another platform). | ||
babolivier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
A label can contain any UTF-8 character, regardless of whether it starts with a | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure we need to support line feeds and non-printable characters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this also means I can't use emoji, and that makes me sad. Or it would if I could label the conversation as sad. |
||
hash or not (i.e. we don't limit the hashtag to e.g. the set of allowed | ||
characters in a Twitter hashtag, and not-hashtag-like labels also don't have | ||
that sort of constraints). | ||
|
||
Clients can use these to filter the overlapping conversations in a room into | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A concern overall with this proposal is that once someone uses |
||
different topics. The labels could also be used when bridging as a hashtag to | ||
help manage the disconnect which can happen when bridging a threaded room to an | ||
unthreaded one. | ||
|
||
Clients are expected to let users add hashtag-like labels to a message before | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do I prevent people from using labels in my room? It's entirely reasonable that a room (maybe our own?) will want to disable threading of any kind because a subset of their users don't use clients which support it. When we enable threading, the conversation makes less sense because suddenly someone says "I agree" in a linear timeline with 2+ concurrent discussions going on. An alternative to using power levels somehow is to introduce a reasonable fallback that can be easily parsed by clients/bridges. Similar to how replies dump a fallback into the |
||
sending it, and to display hashtag-like labels on messages to help user easily | ||
identify the labels they can make their client filter on. | ||
|
||
Clients are expected to explicitly set the label on a message if the user's | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "the" implies there's only one you have to worry about - shouldn't you be copying the whole set? |
||
intention is to respond as part of a given labelled topic. For instance, if the | ||
user is currently filtered to only view messages with a given label, then new | ||
messages sent should use the same label. Similarly if the user sends a reply to | ||
a given message, that reply should typically use the same labels as the message | ||
being replied to. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My remaining doubt is how we're expecting clients to expose adding a label initially, eg. if there would be another button in the composer or similar to add a label to a new message or whether you'd just let them be added retrospectively. Likewise, would we expect to show the labels on each message / show on hover etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've added some clarifications on that point in 4b7ca52, but designing UI isn't my area of expertise so I'm happy to discuss it. |
||
|
||
When a user wants to filter a room to given label(s), it defines a filter for | ||
use with `/sync`, `/context`, `/search` or `/messages` to limit appropriately. | ||
This is done by new `labels` and `not_labels` fields to the `EventFilter` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently, this doesn't seem to make it possible to expose to the user which labels are present in a room (since the client would have to scan all of the events in a room for labels). IMO, it would be poor UX to not have available labels visible. I suppose there could be two ways of doing this:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I made a proposal about an endpoint to get the list of labels available in a given room here: #2326 (comment), but haven't gotten around to add it to the proposal yet. |
||
object, which specifies a list of labels to include or exclude in the given | ||
filter. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This may be something for another MSC down the road, but would it make sense to have power level restrictions for labels? For example, I may want to have an announcements label in my room that only bots and I can post in. For example, a GitHub status label could be useful to filter out announcement spam from the main room conversation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an interesting thought, but is probably out of the scope of this MSC imho |
||
Senders may edit the `m.label` fields in order to change the labels associated | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it might be premature to discuss editing events before MSC1849 is ready; but let's at least link to it to clarify what we mean by "edit" |
||
with an event. If an edit removes a label that was previously associated with | ||
the original event or a past edit of it, neither the original event nor an edit | ||
of it should be returned by the server when filtering for events with that | ||
label. | ||
|
||
When editing the list of labels associated with an encrypted event, clients must | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This paragraph should be under "Encrypted rooms" |
||
set the updated list of labels in the `content` field of the encrypted event in | ||
addition with the `m.new_content` field of the decrypted event's `content` | ||
field, so that servers can update the list of labels associated with the | ||
original event accordingly. | ||
|
||
When sending an edit of an event that has labels attached to it, clients are | ||
expected to provide a list of labels, even if the edit doesn't add or remove any | ||
label from the list provided in the original event or its latest edit (in this | ||
case, the list is the same as the one provided in the original event or its | ||
latest edit). | ||
|
||
### Encrypted rooms | ||
|
||
In encrypted rooms, the `m.label` field of `m.room.encrypted` events contains, | ||
for each label of the event that's being encrypted, a SHA256 hash of the | ||
concatenation of the text label and the ID of the room the event is being sent | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the client isn't required to lowercase the label before sending, it's just expected that they do. To make this a bit safer to use it would be best if the client received an error for not lowercasing their labels. |
||
to, i.e. `hash = SHA256(label_text + room_id)`. | ||
babolivier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The reason behind using a hash built from the text label and the ID of the room | ||
here instead of e.g. a random opaque string or a peppered hash is to maintain | ||
consistency of the key without having access to the entire history of the room | ||
or exposing the actual text of the label to the server, so that e.g. a new | ||
client joining the room would be able to use the same key for the same label as | ||
any other client. See the ["Alternative solutions"](#alternative-solutions) for | ||
more information on this point. | ||
|
||
babolivier marked this conversation as resolved.
Show resolved
Hide resolved
|
||
When filtering events based on their label(s), clients are expected to use the | ||
hash of the label(s) to filter in or out instead of the actual label text. | ||
|
||
#### Example | ||
|
||
Consider a label `#fun` on a message sent to a room which ID is | ||
`!someroom:example.com`. Before encryption, the message would be: | ||
|
||
```json | ||
{ | ||
"type": "m.room.message", | ||
"content": { | ||
"body": "who wants to go down the pub?", | ||
"msgtype": "m.text", | ||
"m.labels": [ "#fun" ] | ||
} | ||
} | ||
``` | ||
|
||
`3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a` is the SHA256 | ||
hash of the string `#fun!someroom:example.com`. Here's an example code | ||
(JavaScript) to compute it: | ||
|
||
```javascript | ||
label_key_unhashed = "#fun" + "!someroom:example.com" | ||
hash = crypto.createHash('sha256'); | ||
hash.write(label_key_unhashed); | ||
label_key = hash.digest("hex"); // 3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a | ||
``` | ||
|
||
Once encrypted, the event would become: | ||
|
||
```json | ||
{ | ||
"type": "m.room.encrypted", | ||
"content": { | ||
"algorithm": "m.megolm.v1.aes-sha2", | ||
"ciphertext": "AwgAEpABm6.......", | ||
"device_id": "SOLZHNGTZT", | ||
"sender_key": "FRlkQA1enABuOH4xipzJJ/oD8fxiQHj6jrAyyrvzSTY", | ||
"session_id": "JPWczbhnAivenK3qRwqLLBQu4W13fz1lqQpXDlpZzCg", | ||
"m.labels": [ | ||
"3204de89c747346393ea5645608d79b8127f96c70943ae55730c3f13aa72f20a" | ||
] | ||
} | ||
} | ||
``` | ||
|
||
## Problems | ||
|
||
Do we care about internationalising hashtags? | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Generally the room will be using a specific language, so probably not. |
||
|
||
Too many threading APIs? | ||
|
||
Using hashes means that servers could be inclined to compute rainbow tables to | ||
read labels on encrypted messages. However, since we're using the room ID as | ||
some kind of hash, it makes it much more expensive to do because it would mean | ||
maintaining one rainbow table for each encrypted room it's in, which would | ||
probably make it not worth the trouble. | ||
|
||
## Alternative solutions | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was just rereading this proposal, I think @dkasak's points on the main thread are legitimate: that hashing the labels give a very false sense of security here. Given how strong our e2ee is, folks will assume opaque labels are actually encrypted, rather than just obfuscated by a hash which can be easily rainbow-tabled. Personally, I think it'd be fine to add a pepper to the hashed events, and require at the application level that for labels to work in encrypted rooms, the new user must be brought up to speed on the pepper (e.g. by the inviter sharing the pepper in an encrypted message, possibly to-device, after having invited them). This is simpler than using opaque IDs for the unencrypted event headers, as there's only one pepper that needs to be shared to new users, rather than the whole set of opaque->real label mappings. |
||
|
||
Instead of using hashes to identify labels in encrypted messages, using random | ||
opaque strings was also considered. Bearing in mind that we need to be able to | ||
use the label identifiers to filter the history of the room server-side (because | ||
we're not expecting clients to know about the whole history of the room), this | ||
solution had the following downsides, all originating from the fact that nothing | ||
would prevent 1000 clients from using each a different identifier: | ||
|
||
* filtering would have serious performances issues in E2EE rooms, as the server | ||
would need to return all events it knows about which label identifier is any | ||
of the 1000 identifiers provided by the client, which is quite expensive to | ||
do. | ||
|
||
* it would be impossible for a filtered `/message` (or `/sync`) request to | ||
include every event matching the desired label because we can't expect a | ||
client to know about every identifier that has been used in the whole history | ||
of the room, or about the fact that another client might suddenly decide to | ||
use another identifier for the same label text, and include those identifiers | ||
in its filtered request. | ||
|
||
Another proposed solution would be to use peppered hashes, and to store the | ||
pepper in the encrypted event. However, this solution would have the same | ||
downsides as described above. | ||
|
||
## Security considerations | ||
|
||
The proposed solution for encrypted rooms, despite being the only one we could | ||
think of when writing this proposal that would make filtering possible while | ||
obscuring the labels to some level, isn't ideal as it still allows servers to | ||
figure out labels by computing [rainbow | ||
tables](https://en.wikipedia.org/wiki/Rainbow_table). | ||
|
||
Because of this, clients might want to limit the use of this feature in | ||
encrypted rooms, for example by enabling it with an opt-in option in the | ||
settings, or showing a warning message to the users. | ||
|
||
It is likely that this solution will be replaced as part of a future proposal | ||
once a more fitting solution is found. | ||
|
||
## Unstable prefix | ||
|
||
Unstable implementations should hook up `org.matrix.labels` rather than | ||
`m.labels`. When defining filters, they should also use `org.matrix.labels` and | ||
`org.matrix.not_labels` in the `EventFilter` object. | ||
|
||
Additionally, servers implementing this feature should advertise that they do so | ||
by exposing a `org.matrix.label_based_filtering` flag in the `unstable_features` | ||
part of the `/versions` response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bubu
Agreed that this would be a handy feature, and that listing all labels in a room is a nice thing to have generally. I'm thinking of one possible way to do this, which would be to add an endpoint that exposes the list of labels the server knows have been used in the room (which should be fairly easy given the server will probably already store
(event_id, label)
tuples in its database for efficiency). For encrypted rooms, this would return a list of hashes (which is what the server considers as a list of labels for that room, since it doesn't know about the actual labels), and clients would then be able to resolve those hashes by calling/messages
with a filter containing the labels to resolve, and extracting the labels from the response (which contains events that the client should be able to decrypt). This would allow such a feature to work well without having to leak more metadata.wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
As for metadata concerns, these are still there, as we are working on a set of ~6 fixed labels for this usecase. But this is basically already covered in the "Security Considerations" section here.
Whether or not the actually usage of these tags for images/links/etc. will become optional in E2EE chats is not part of this MSC I believe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we resolve this thread?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do that once I've updated the MSC to describe this solution, which I haven't got time to do yet.