diff --git a/proposals/2162-signaling-errors-at-bridges.md b/proposals/2162-signaling-errors-at-bridges.md new file mode 100644 index 00000000000..9414f9b2eb6 --- /dev/null +++ b/proposals/2162-signaling-errors-at-bridges.md @@ -0,0 +1,312 @@ +# Signaling Errors at Bridges + +Sometimes bridges just silently swallow messages and other events. This proposal +enables bridges to communicate that something went wrong and gives clients the +option to give feedback to their users. Clients are given the possibility to +retry a failed event and bridges can signal the success of the retry. + +## Proposal + +Bridges might come into a situation where there is nothing more they can do to +successfully deliver an event to the foreign network they are connected to. Then +they should be able to inform the originating room of the event about this +delivery error. The user in turn should be able to instruct the bridge to retry +sending the message that was presented him as failed; the bridge should have the +ability to mark an error as being revoked. + +If [MSC 1410: Rich +Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) is utilized for +this proposal it would additionally give the benefits of + +- trimming the number of properties required in each bridge error event by + separately providing these general infos about the bridge in the room state instead. +- not requiring users representing the bridge to have admin power levels + (see [Rights management](#rights-management)). + +### Bridge error event + +This document proposes the addition of a new room event with type +`m.bridge_error`. It is sent by the bridge and references an event previously +sent in the same room, by that marking the original event as “failed to deliver” +for all users of a bridge. The new event type utilizes reference aggregations +([MSC +1849](https://github.com/matrix-org/matrix-doc/blob/matthew/msc1849/proposals/1849-aggregations.md#relation-types)) +to establish the relation to the event its delivery it is marking as failed. +There is no need for a new endpoint as the existing `/send` endpoint will be +utilized. + +Additional information contained in the event are the name of the bridged +network (e.g. “Discord” or “Telegram”) and a regex array¹ describing the +affected users (e.g. `@discord_.*:example.org`). This regex array should be +similar to the one any Application Service uses for marking its reserved user +namespace. By providing this information clients can inform their users who in +the room was affected by the error and for which network the error occurred. + +*Those two fields will not be required if the variant with [MSC 1410: Rich +Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) is adopted. In +this case the same information is stored alongside other bridge metadata in the +room state* + +There are some common reasons why an error occurred. These are encoded in the +`reason` attribute and can contain the following types: + +* `m.event_not_handled` Generic error type for when an event can not be handled + by the bridge. It is used as a fallback when there is no other more specific + reason. + +* `m.event_too_old` A message will – with enough time passed – fall out of its + original context. In this case the bridge might decide that the event is too + old and emit this error. + +* `m.foreign_network_error` The bridge was doing its job fine, but the foreign + network permanently refused to handle the event. + +* `m.unknown_event` The bridge is not able to handle events of this type. It is + totally legitimate to “handle” an event by doing nothing and not throwing this + error. It is at the discretion of the bridge author to find a good balance + between informing the user and preventing unnecessary spam. Throwing this + error only for some subtypes of an event is fine. + +* `m.bridge_unavailable` The homeserver couldn't reach the bridge. + +* `m.no_permission` The bridge wanted to handle an event, but didn't have the + permission to do so. + +The bridge error can provide a `time_to_permanent` field. If this field is +present it gives the time in milliseconds one has to wait before declaring the +bridge error as permanent. As long as an error is younger than this time, the +client can expect the possibility of the error being revoked. If a bridge error +is permanent, it should not be revoked anymore. In case this field is missing, +the error will never be considered permanent. + +Notes: + +- Nothing prevents multiple bridge error events to relate to the same event. + This should be pretty common as a room can be bridged to more than one network + at a time. + +- A bridge might choose to handle bridge error events, but this should never + result in emitting a new bridge error as this could lead to an endless + recursion. + +The need for this proposal arises from a gap between the Matrix network and +other foreign networks it bridges to. Matrix with its eventual consistency is +unique in having a message delivery guarantee. Because of this property there is +no need in the Matrix network itself to model the failure of message delivery. +This need only arises for interactions with foreign networks where message +delivery might fail. This proposal extends Matrix to be aware of these error +cases. + +Additionally there might be some operational restrictions of bridges which might +make it necessary for them to refrain from handling an event, e.g. when hitting +memory limits. In this case the new event type can be used as well. + +This is an example of how the new bridge error might look: + +``` +{ + "type": "m.bridge_error", + "content": { + "network: "Discord", + "affected_users": ["@discord_.*:example.org"], + "reason": "m.bridge_unavailable", + "time_to_permanent": 900, + "m.relationship": { + "rel_type": "m.reference", + "event_id": "$some:event.id" + } + } +} +``` + +\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\ +¹ Or similar – see [Security Considerations](#security-considerations) + +### Retries and error revocation + +Providing a way to retry a failed message delivery gives the sender control over +the importance of her message. An extra procedure for a retry is necessary as +the message might have been delivered to some users (those not on the bridge) +and this would produce duplicate messages for them. + +A retry request is posted by the client to the room for all bridges to see it, +referencing the original event. By inspecting the sender of all related +`m.bridge_error` events, under all bridges the correct one can find out that it +is responsible. The responsible bridge re-fetches the original event and retries +to deliver it. + +A successful retry should be communicated by revoking (not redacting) the +original error that made the retry necessary. Revocation is done by an event +with the type `m.bridge_error_revoke` which references the original event. The +error(s) having a sender of the same bridge as the revocation event are +considered revoked. Clients can show a revocation message e.g. as “Delivered to +Discord at 14:52.” besides the original event. + +On an unsuccessful retry the bridge may edit the error's content to reflect the +new state, e.g. because the type of error changed or to communicate the new +time. + +Example of the new retry events: + +``` +{ + "type": "m.bridge_retry", + "content": { + "m.relationship": { + "rel_type": "m.reference", + "event_id": "$original:event.id" + } + } +} +``` + +``` +{ + "type": "m.bridge_error_revoke", + "content": { + "m.relationship": { + "rel_type": "m.reference", + "event_id": "$original:event.id" + } + } +} +``` + +Overview of the relations between the different event types: + +``` + m.references + ________________ _____________________ +| | | | +| Original Event |-+-| Bridge Error | +|________________| | |_____________________| + | _____________________ + | | | + +-| Retry Request | + | |_____________________| + | _____________________ + | | | + +-| Bridge Error Revoke | + |_____________________| +``` + +A retry might not make much sense for every kind of error e.g. retrying +`m.unknown_event` will probably result in the same error again. Clients may +choose to disable retry options for those cases, but it is not restricted +otherwise. + +### Special case: Unavailable bridge + +In the case the bridge is down or otherwise disconnected from the homeserver, it +naturally has no way to inform its users about the unavailability. In this case +the homeserver can stand in as an agent for the bridge and answer requests in +its absence. + +For this to happen, the homeserver will send out a bridge error event in the +moment a transaction delivery to the bridge failed. The clients at this point +will start showing an error. When the bridge comes back online it will encounter +a higher-than-normal load as all events accumulated over the downtime are +flooding in. To handle this scenario well, the bridge will want to simply +discard all messages older than a given threshold and not bother with sending +any answer back. + +By including a timeout in the `time_to_permanent` field of the event, the client +will know without further feedback from the homeserver or bridge when the +message won't be delivered anymore. + +For those events still accepted by the bridge, the error must be revoked by a +`m.bridge_error_revoke` as described in the previous chapter. + +**Note:** For this to work, the homeserver is required to impersonate a user of +the bridge as it has no agent of its own. The impersonated user would be the +bridge bot user or one of the virtual users in the bridge's namespace. + +### Rights management + +Only bridges should be allowed to send bridge errors and revocations. + +Utilizing the rights system of the room provides a good approximation to this +behavior. It is fine to use it under the assumptions that + +- `m.bridge_error` and `m.bridge_error_revoke` require admin power levels. +- there is always the bridge bot user or a virtual user in the bridge's + namespace present in the room. +- at least one of those users possesses admin power level. +- all users with admin power levels are trusted. + +In short, this requires giving bridges admin power levels in a room and trusting +them to restrict their actions to their own business. It is enough to have one +privileged bridge user in the room. In public rooms this is most commonly the +bridge bot user with admin power level available and in 1:1 conversations it is +the puppeted conversation partner which does generally have admin power levels +as well. + +As long as the above assumptions are met, it is fine to not explicitly denote +bridges and bridge users as such and simply rely on the power levels for access +control to the new events. + +An alternative for the above solution is the adoption of [MSC 1410: Rich +Bridging](https://github.com/matrix-org/matrix-doc/issues/1410). It stores +information about users affiliation to a bridge in the room state. Instead of +checking power levels of users, rich bridging can be utilized by checking the +room state and only allow valid representatives of the bridge to send bridge +errors and their revocations. This alternative has the advantage of not +requiring agents of the bridge to be powerful. They would be verifiable and +could be trusted without any restrictions regarding their power levels. + +## Tradeoffs + +Without this proposal, bridges could still inform users in a room that a +delivery failed by simply sending a plain message event from a bot account. This +possibility carries the disadvantage of conveying no special semantic meaning +with the consequence of clients not being able to adapt their presentation. + +A fixed set of error types might be too restrictive to express every possible +condition. An alternative would be a free-form text for an error message. This +brings the problems of less semantic meaning and a requirement for +internationalization with it. In this proposal a generic error type is provided +for error cases not considered in this MSC. + +The nature of a retry request from a client to the bridge lends it more to an +ephemeral type of transport than something permanent like a PDU, but it was +advised against it for The Spec doesn't make implementations of new EDU types +easy. Applications Services in general don't allow listening to EDUs, so further +changes to The Spec would be necessary before following the probably more +appropriate route here. + +A new event type `m.bridge_error_revoke` is introduced for revoking a bridge +error. Alternatively it could be considered to redact the bridge error event, +which would eliminate the need for the revocation event and would make this +proposal a little simpler. The disadvantage of this approach is the missing +transparency and context of who had which information at which point in time. +This additional information should make for a better user experience. + +## Potential issues + +When the foreign network is not the cause of the error signaled but the bridge +itself (maybe under load), there might be an argument that responding to failed +messages increases the pressure. + +## Security considerations + +Sending a custom regex with an event might open the doors for attacking a +homeserver and/or a client by exposing a direct pathway to the complex code of a +regex parser. Additionally sending arbitrary complex regexes might make Matrix +more vulnerable to DoS attacks. To mitigate these risks it might be sensible to +only allow a more restricted subset of regular expressions by e.g. requiring a +maximal length or falling back to simple globbing. + +When utilizing power levels instead of building on [MSC 1410: Rich +Bridging](https://github.com/matrix-org/matrix-doc/issues/1410) a malicious user +who has enough power to send `m.bridge_error` or `m.bridge_error_revoke` is able +to impersonate a bridge. She will be able to wrongly mark messages as failed to +deliver or revoke errors when they were not successfully retried. + +## Conclusion + +In this document an event is proposed for bridges to signal errors and a way to +retry and revoke those errors. The event informs the affected room about which +message errored for which reason; it gives information about the affected users +and the bridged network. By implementing the proposal Matrix users will get more +insight into the state of their (un)delivered messages and thus they will become +less frustrated.