-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have a "fix everything" button for E2EE / Dealing with E2EE breakage #20685
Comments
Adding on from some discussion in #-dev; The majority of E2EE self-healing is reactive to a situation at hand, it does not try to proactively analyse the latent state from participants involved to discover the error, and then fix it. This issue essentially calls for the latter to be a manual process, but the alternatives section recommends it being an automatic one. Furthermore, in situations where these desynchronisations/errors have caused irreparable damage, a fallback option (even if it is through human confirmation) should be given to one-time recover E2EE history for all participants involved. |
If we know which things needed to be flushed .. we can just fix the bugs that have lead to corrupt state, rather than create an embarrassing band-aid solution. This is what the Element cryptography team is currently doing, using analytics to measure the rate of encryption errors and drive them down. It is a long and complex process that requires a lot of attention to detail, but we are making concrete progress. If you have suggestions for exactly which data stores or user properties you would like to flush, that might be a more tractable request, but otherwise, I'm inclined to close this as the team will not action it in its current form. |
An incomplete list could be;
The above are giant reset buttons, but possibly the following could be proactive/selective;
Additionally, a bunch of assertions and assumptions that E2EE has about the current state of the world must be somehow exposed so that code can look at it, compare it (either with this button or another "weird state" trigger), and then take action to correct it, if possible. @novocaine if this issue in the state of "provide a giant button" isn't feasible, could a variant which says "be more proactive with E2EE problems" work better? (with the core request to probe and fix E2EE inconsistency before it can lead to unrecoverable errors) My main tenet with this issue is that past users do not have any possible way to fix their E2EE, maybe booting up in an element version that automatically detects, analyses, and fixes that could help their problem, but I know it's further off than just providing them a button. What I assume the cryptography teams are doing is providing better happy paths to E2EE, but it does not help with users who have already entered a FUBAR state. |
I think some of these state clears are interesting options but only if the work to implement them is cheap; otherwise we're better off spending the time to try to fix the root causes. I've raised this internally with the team. |
Quick Answer: Our ultimate goal is to fix the root cause, but meanwhile we can't let down users. I have a live demo of it to showcase usage, will see our to share. Thanks for the issue, I'll review it in more details and see how to use it to improve our ideas. thx |
@BillCarsonFr thanks for the response, I think this is a good step forward, but one thing I'd like to remind you of is that, while these issues and associated features will increase user visibility to how the innards of E2EE work, these are only useful for users which have intricate knowledge of the innards of E2EE. My point still stands about self-healing E2EE. If this is what you're addressing, then all is good, I'm just following an impression given by those issues. (Which btw, looks amazing for E2EE debug) |
Yes that's what we want to address ultimatly. These issues are not yet bringing us there, we are iterating on it seeing what it really usefull and then better package it as a self healing feature. |
(Interestingly i missed #3553 when looking for issues, though it's related to this.) |
@novocaine There's a ton of these issues, and the problem is that some of these E2EE errors might be due to servers (synapse vs dendrite vs etc) or even various clients. So it seems worthwhile for whatever matrix team controls to have the best diagnostics possible, ideally diagnostics that can self-heal or guide user into some simple steps that will heal/fix issues. I commented how Element-Desktop app with Unable to decrypt error gave very little useful info in the CTRL-SHIFT-I console log see my post at: It's very difficult to attract people to use Matrix system instead of other chat systems when they make an effort to try it and then "immediately" we have problems with chat not working at all. Because we do have an "in-between" server, we should be able to have clients plus server self-triangulate such issues and ideally self-heal with some minor notifications or prompting to users. |
This. |
I'm very much in agreement with @novocaine: a "fix everything button" is not a good solution here. Why even make the user press the button if we can magically fix everything? We know how frustrating encryption problems are and we are working on fixing them, and improving visibility of why they happen. Putting on band-aids which will sometimes fix up problems after they happen is a distraction from actually making the system reliable in the first place. |
Your use case
What would you like to do?
Have a "Fix Everything" button under E2EE settings, that will analyse, detect issues, and repair problems with E2EE as best as it can, and list the remaining unfixable problems to the user.
Why would you like to do it?
Currently today on matrix, E2EE breakage is pretty often, it is also a severe issue when it happens, and as E2EE is often based on happy-path behaviour, and its security sensitivity does not allow much flexibility with fixing issues with it, problems started in E2EE will often persist until "magical incantations" are made by the corresponding users to force behaviour to fix itself.
However, normal users will very quickly be turned off from Element/Matrix once they encounter these issues, and as they're hard to debug, and often derived from many states (on the servers involved, and on the users' devices), they dont magically fix themselves, even if the original issue is long resolved. These issues will very likely prop up in "sensitive" scenarios, as E2EE is default on many DMs, these one-to-one conversations will be interrupted and disrupted, heightening any negative emotions the user might have at the platform.
How would you like to achieve it?
So, as a universal bandaid, a button like this would probably flush a lot of E2EE state, try to reestablish sessions, signal to other devices and users to re-synchronise themselves, and generally try to identify as many issues as possible to fix. This all in a "soft" way, this should not clear a user's login, or other data.
This very likely would require spec changes, and/or a rethinking of some E2EE interactions, this would likely introduce security risks as tradeoffs, so this has to be threaded with caution.
Have you considered any alternatives?
This is only an alternative to the specific mechanism here proposed, but some further context;
E2EE in matrix is currently not self-healing in certain conditions, there is only a happy path then and there, and no real mechanism to reconcile when devices/sessions have strayed from that path. This is likely due to the security implications that such recovery mechanisms would have.
A "fix everything" button will help the immediate problem of users having no option to recover their E2EE-powered rooms sanely, but this doesn't fix the main problem;
Even if for the sake of security, (un)intentional disruptions can happen along many points of the pipeline that E2EE relies on, and recovering from this should both be a UX and security priority. E2EE cannot (again, even for the sake of security) model itself primarily along the lines of a "100% spec-compliant" server.
There are two kinds of E2EE breakage;
The latter is the problem here, and a solution could be to have matrix E2EE differentiate the two.
Additional context
Related issues
element-hq/element-meta#310, #2996, #18881, #12250, #11049, element-hq/element-meta#1563, #18639, #16163, #17578, element-hq/element-meta#1930, #14820, #12851, #20670, #15388, #17500, #17622, #14921, #16086, #20247, #18541, #9219, #7312, #16614, #16613, #16458, #13744, #12250, and element-hq/element-meta#1859 are related to this.
element-hq/element-meta#1565, #16184, element-hq/element-meta#1894, #5675, and #15416 are tangentially related to this.
#20005, #18505, #18443, #13575, #11094, #6879, and #9434 are related to this insofar that this "fix everything" process should detect and address them.
#13582 (comment) is an example of a server issue causing an inconsistent state.
#3868 is also an example of potential introductions of inconsistent state.
TL;DR: Windows Troubleshooter for E2EE, but actually reliable and helpful.
Also, E2EE should possibly become more reliable and robust than matrix can be at its worst times.
The text was updated successfully, but these errors were encountered: