-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Shared rooms API doesn't work if the worker handling the request doesn't update the user directory #10339
Comments
One possibility would be to shift the Assuming this option is set for all workers, then workers can both:
We could also sort of munge that today with the boolean option (True vs False vs |
I think the best course of action would be to simply note that this api needs to also be served on the same worker as the user directory one, and return |
The check on On the other side however, we have a choice between failing loudly (returning an error on the shared rooms API if |
If you were to guess, how out-of-date would the table be, if it's requested from another worker? Is the user directory worker configured to only flush every so often? What'd be a reason why the table be out of date? Database driver caching? |
@ShadowJonathan I don't think it's problematic to have another process handle shared room API calls while another worker is updating the user directory. The user directory also updates the database immediately, rather than routinely flushing results, so there shouldn't be much delay if any. When I said the table would be out of date, I meant in the case that there was no worker process at all updating the user directory. |
So the problem is essentially that we need to be sure that the user directory is constantly being updated while this API is available to be queried from? I guess that is indeed something i'd like to see assured as well before we expose this. In that case, I definitely like the |
It's not quite that simple unfortunately. Not all workers need to have HTTP listeners configured, so there is no standard way to ping them currently. I don't think we need to come up with too technical of a solution though. I would be in favour of switching to a If you've forgotten to start a worker which you've disabled configuration elsewhere for, then that'll cause problems far beyond this endpoint. It may be worth making that situation less idiot-proof, but it's outside the scope of this issue. |
Wouldn't it be easy to map |
Aha, yes, technically speaking that would work out. Though that's something we've not yet done before, so we should be cautious before choosing that solution. As a devil's advocate, I can think of two potential reasons to be wary:
Something that may solve both of these problems is to give the main process a "name" that could be put in place of None/null here. |
I think a prerequisite to siding more fully with workers is to develop some comprehensive solution to the two main problems ive seen floating around;
Anyways, those two problems are separate to this, i think the best course of action then is to introduce How's that sound? |
Hmm, actually, now that I've had some time to think about it I go back on what I said. Canonicalising a name for the main process only further cements it as being a single worker. And I don't think there's really a valid reason to disable updating the user directory, especially if we start relying on its functionality for other endpoints... So I'm actually now in favour of your original suggestion in #10339 (comment), where |
Alright then, the next course of action seems to be to;
|
@ShadowJonathan That sounds sensible to me. It's acknowledged that if this value is set to something other than a running worker then the user directory will not be updated and that the shared rooms API may become out of date. |
👍 I’ll be sure to add a config documentation note about that as well. |
wait, what endpoint are we actually complaining about here? MSC2666 is |
Ok, I think, in spite of what the MSC says, we're talking about |
(The endpoint was changed in the MSC long after the Synapse impl, and this issue was created) |
(a) then the synapse impl needs updating, no? I've now updated the title and description. I'm annoyed at having to waste time on this. |
I remain confused about the problem and the proposed solution here. AIUI, the current situation is that the user directory must be updated and searched on a single worker process. "Searching" means the current documented endpoint This issue proposes to relax that requirement for Please could somebody explain to me why they believe that is safe? The guard must have been there for a reason; if it no longer applies, what has changed? And if it is safe, does this also apply to |
I was planning to do that after #11450, I wanted that to finish first before I touched that change.
Nothing wrt the code, the idea behind this config value was that it'd force the admin to take a closer look at what they'd put there, while defaulting the value to the main process. The idea is that, if it is at least running somewhere, and because there is higher visibility as to where it is running from the config (because you have to specify the specific worker's name), there's a higher chance that the user directory updater is running somewhere in the worker cluster. #11450 tries to encode this in the config, as it defaults the user directory updater to the main process (which we'll assume is always running), and makes any change have to have the admin point it to a specific worker, which has a slightly better chance of being corrected if the worker doesnt exist anymore, if an admin were to glance it. (Compared to a This would allow the endpoint to run on any client-server servlet, though I see this isn't marginally any more safe. The real safe way to do this would be to introduce application-level worker health awareness in synapse, but i'm not sure how manageable that is. Frankly, this is the issue that was pointed to that was blocking everything, and I wanted to put some progress in this feature. |
TL;DR this has gotten complicated and I'm no longer convinced that the proposed solution actually improves the situation over the config we have today. Shall we just remove the check instead? In summary: MSC2666 introduced an endpoint for clients to discover what rooms the user shares with any other given user. The endpoint has changed shape in December, but was initially implemented in Synapse as The handler for this endpoint uses the Because of this, the original implementation included a check for whether The solution initially proposed above was to deprecate the A bit later, in #10339 (comment) a few comments down, it was proposed for Honestly, after having read everything over I think we've essentially arrived at where we started. The solution of And at this point I feel that's something that it shouldn't care about. If we just remove the check, we gain the ability to run this endpoint across different workers, without the need for deprecations. If the user directory isn't running on a configured worker - which is not the default behaviour - then that was the choice of the sysadmin; and the docs should warn about that implication. An option like Sorry for the massive run-around. |
Thanks for the summary, this is helpful with knowing where this issue stands. I agree with the above points, and I'd wanna make a PR which simply removes the check. I think at this point #11450 has run its course, and at this point it's no use to push it forward anymore, so I'm thinking of closing it on that note (if everyone agrees with this). Instead, to resolve this issue, I'll open a simple PR which removes the check and adds an extra note/warning to the How's that sound? |
Sounds good to me. |
Could you be explicit as to whether this is a correct understanding? https://github.com/matrix-org/synapse/blob/develop/docs/workers.md#synapseappuser_dir says that (OTOH, it would be far from the only place that the workers doc is misleading/wrong.) |
From a quick look at the code, it doesnt seem like I imagine it was written this way to make sure that there's a running user directory updater somewhere (either on main or on the worker), as to make things more fool-proof. The only real thing these endpoints ( Imo, the documentation should mention that the preferred way to do this is to make sure those paths are handled by the dedicated worker, but that it can be handled by any other (if the admin knows what they're doing). (Though the code confusingly sets the config |
right, understood. Please could you open an issue noting the incorrect documentation? |
Ignoring worker types, I'm not sure how one worker having I think we can also deprecate and remove the Edit: Ah, so the |
Actually, this will not work, as the config script will force the config option off for all other kinds of workers; synapse/synapse/app/generic_worker.py Lines 461 to 475 in 4ae956c
Unless we remove this bit, ofc, but we'd need a migration path for that, as it could then lead to all processes simultaneously believing they should update the user directory. In any case, i'm currently interested in resolving this issue, so i'm still intending to submit a PR removing this check, adding extra documentation to |
If we remove the check on |
#7785 added
^/_matrix/client/unstable/uk.half-shot.msc2666/user/shared_rooms/.*$
.We check the
update_user_directory
config flag when handling this API, to basically sanity check that the tables we're about to query are roughly up-to-date (users_who_share_private_rooms
,users_in_public_rooms
). However, if the request is handled on a different synapse worker which does not have that config flag set, then the request is wrongly refused.I'm not sure if there is an easy way to check if another Synapse process is configured to update the user directory, or whether we should just remove the check. Another option would just be to ensure this endpoint is handled on the same worker.
The text was updated successfully, but these errors were encountered: