Fix assertions being thrown by the EventsStream update function #7337

richvdh · 2020-04-23T17:34:54Z

The problem here was that the EventsStream update function could return more rows than was requested of it, which was then upsetting the db_query_to_update_function wrapper which didn't know how to handle this. The confusion over whether the limit was a limit or a target was introduced in #7024, but I think there are much longer-standing bugs here in the limit logic.

So, part of the fix here is to make it clear that the limit passed into the update_function is a target rather than a hard limit, and to avoid using db_query_to_update_function which is too simple for this usecase. But to get it right, we need to pull get_ex_outlier_stream_rows out to a separate query (to make it easier to see if we have hit the limit), and apply more intelligence about how we truncate the results from the three queries.

Tests and an answer to the FIXME to follow.

there doesn't seem to be much point in passing this limit all around, since both sides agree it's meant to be 100.

richvdh · 2020-04-24T09:16:55Z

This is intended to fix #7308.

erikjohnston

There is quite a lot going on here, but I think I now understand what's happening: we're specifically fixing a bug where EventStream.update_function was basically truncating the individual sources before merging rather than after merging, leading to potentially missing updates when the stream is limited.

It feels like this is only really fixing #7308 because we just so happen to now be avoiding the call path that includes the assertion, which is fine but...

This doesn't fix db_query_to_update_function to work with DB functions that may returns more than the given limit (which may be a case of converting the len == check to a len >= one, if callers are indeed happy with update functions returning more rows than expected). I think there are probably more DB functions that work like this than just the event stream? If not we should document that db_query_to_update_function expects the given function to treat the passed in limit as a hard limit.

Separately, as per comment we need to figure out how to better handle the limiting of get_all_updated_current_state_deltas.

The above is more me trying to get my head around the changes here. I think these changes make the situation strictly better and so I don't propose you fix the above bugs in this PR, but are things that need to be fixed.

erikjohnston · 2020-04-24T09:35:50Z

synapse/replication/tcp/streams/events.py

+        # if we hit the limit on event_updates, there's no point in going beyond the
+        # last stream_id in the batch for the other sources.
+
+        assert len(event_rows) <= target_row_count


I think its worth adding a comment that its safe to do this because get_all_new_forward_event_rows will honour the limit. TBH its not clear to me that these assertions help any and could be replaced the limited checks with len(event_rows) >= target_row_count

TBH its not clear to me that these assertions help any and could be replaced the limited checks with len(event_rows) >= target_row_count

I disagree. The only reason that it's safe to do upper_limit = event_rows[-1][0] is because we understand exactly how get_all_new_forward_event_rows behaves at the limit. If it starts returning more rows than the limit, then clearly we're not understanding its behaviour.

I think it's exactly that not-understanding that has allowed the bugs in this code to creep in and linger for so long, and I think exactly the same applies to the assertion in db_query_to_update_function: I wouldn't want to remove the assertion there.

I'll add a comment to explain this a bit.

erikjohnston · 2020-04-24T09:37:17Z

synapse/storage/data_stores/main/events_worker.py

@@ -973,8 +973,18 @@ def get_current_events_token(self):
        return self._stream_id_gen.get_current_token()

    def get_all_new_forward_event_rows(self, last_id, current_id, limit):
-        if last_id == current_id:
-            return defer.succeed([])


Why has this been removed?

it's redundant: there is another check for the same thing up in get_updates_since, and if in the future we do manage to call get_all_new_forward_event_rows with last_id == current_id, it still behaves correctly.

I'm not a fan of shortcuts like this - they make me suspicious that the algorithm they are short-cutting is incorrect somehow. Obviously if there's a significant efficiency gain from the shortcut, there's an argument for it, but since in practice the shortcut is there in get_updates_since, I don't think it has any merit here.

Three thoughts here:

I'd rather functions did easy wins in terms of performance to avoid foot guns when changes are made 10 layers up the stack, especially when pretty much all other functions that fetch streams from the DB do this and other optimisations.

Avoiding a DB query is a big win for functions that get called in a loop.

Personally I don't really feel that the presence of a short circuit makes me any more or less inclined to trust the algorithm is right.

erikjohnston · 2020-04-24T10:08:08Z

synapse/replication/tcp/streams/events.py

+        if len(state_rows) == target_row_count:
+            assert state_rows[-1][0] <= upper_limit
+            upper_limit = state_rows[-1][0]
+            limited = True


I don't think this is safe, either:

get_all_updated_current_state_deltas takes a hard limit and may not return all rows with the last stream_id, or

get_all_updated_current_state_deltas always returns all rows for the last stream_id, and so may return more than limit items.

Currently, it looks like we're in the first case, and so we need to discard all rows with the stream_id of state_rows[-1][0].

Right. I think this is the same thing as I was worrying about for the FIXME, below.

erikjohnston · 2020-04-24T10:10:00Z

synapse/replication/tcp/streams/events.py

@@ -174,6 +174,10 @@ async def _update_function(
            upper_limit = state_rows[-1][0]
            limited = True

+            # FIXME: is it a given that there is only one row per stream_id in the
+            # state_deltas table (so that we can be sure that we have got all of the
+            # rows for upper_limit)?


No it isn't a given.

i.e. they can have multiple rows per stream ID afaict

Yeah, I suspected as much. I was rather hoping that this wouldn't be any worse than it has been for years, and that it would therefore be safe to deploy while I thought about it a bit harder.

erikjohnston · 2020-04-24T10:33:22Z

synapse/replication/tcp/streams/events.py

+        # "stream token" which is supposed to indicate how far we have got through
+        # all three streams. It's therefore no good to return rows 1-1000 from the
+        # "new events" table if the state_deltas are limited to rows 1-100 by the
+        # target_row_count.


It took me a while to realise the crux here is that: "for each source we must return all rows up to the same token."

Fair. Will add another comment.

richvdh · 2020-04-24T11:29:06Z

This doesn't fix db_query_to_update_function to work with DB functions that may returns more than the given limit

As above: I'm not convinced that just lifting the restriction is the right solution - I think the fact that we were lax on these checks is what allowed us to mess up here previously.

And yes, I haven't gone through the other db_query functions to check whether they honour the limit or not. I assumed that the others were simpler and therefore more likely to honour the limit, but I agree it's worth checking. Again though, I'd like to defer it for now.

erikjohnston

Given this is breaking stuff and the functional changes are fine, lets merge it. I do think we need to figure out where we're going with these changes as right now I think its turning into a bit of a mess with assertions everywhere, and that feels wrong and we should either be consistent with our API contracts (and maybe assert those) or move the logic based on asserted assumptions down into the DB functions.

erikjohnston · 2020-04-24T12:48:25Z

(I should also say that this is a net improvement, thank you 🙂 )

Figuring out how to correctly limit updates from this stream without dropping entries is far more complicated than just counting the number of rows being returned. We need to consider each query separately and, if any one query hits the limit, truncate the results from the others. I think this also fixes some potentially long-standing bugs where events or state changes could get missed if we hit the limit on either query.

Make sure that the AccountDataStream presents complete updates, in the right order. This is much the same fix as #7337 and #7358, but applied to a different stream.

@nekatak

Synapse 1.13.0 (2020-05-19) =========================== This release brings some potential changes necessary for certain configurations of Synapse: * If your Synapse is configured to use SSO and have a custom `sso_redirect_confirm_template_dir` configuration option set, you will need to duplicate the new `sso_auth_confirm.html`, `sso_auth_success.html` and `sso_account_deactivated.html` templates into that directory. * Synapse plugins using the `complete_sso_login` method of `synapse.module_api.ModuleApi` should instead switch to the async/await version, `complete_sso_login_async`, which includes additional checks. The former version is now deprecated. * A bug was introduced in Synapse 1.4.0 which could cause the room directory to be incomplete or empty if Synapse was upgraded directly from v1.2.1 or earlier, to versions between v1.4.0 and v1.12.x. Please review [UPGRADE.rst](https://github.com/matrix-org/synapse/blob/master/UPGRADE.rst) for more details on these changes and for general upgrade guidance. Notice of change to the default `git` branch for Synapse -------------------------------------------------------- With the release of Synapse 1.13.0, the default `git` branch for Synapse has changed to `develop`, which is the development tip. This is more consistent with common practice and modern `git` usage. The `master` branch, which tracks the latest release, is still available. It is recommended that developers and distributors who have scripts which run builds using the default branch of Synapse should therefore consider pinning their scripts to `master`. Features -------- - Extend the `web_client_location` option to accept an absolute URL to use as a redirect. Adds a warning when running the web client on the same hostname as homeserver. Contributed by Martin Milata. ([\#7006](#7006)) - Set `Referrer-Policy` header to `no-referrer` on media downloads. ([\#7009](#7009)) - Add support for running replication over Redis when using workers. ([\#7040](#7040), [\#7325](#7325), [\#7352](#7352), [\#7401](#7401), [\#7427](#7427), [\#7439](#7439), [\#7446](#7446), [\#7450](#7450), [\#7454](#7454)) - Admin API `POST /_synapse/admin/v1/join/<roomIdOrAlias>` to join users to a room like `auto_join_rooms` for creation of users. ([\#7051](#7051)) - Add options to prevent users from changing their profile or associated 3PIDs. ([\#7096](#7096)) - Support SSO in the user interactive authentication workflow. ([\#7102](#7102), [\#7186](#7186), [\#7279](#7279), [\#7343](#7343)) - Allow server admins to define and enforce a password policy ([MSC2000](matrix-org/matrix-spec-proposals#2000)). ([\#7118](#7118)) - Improve the support for SSO authentication on the login fallback page. ([\#7152](#7152), [\#7235](#7235)) - Always whitelist the login fallback in the SSO configuration if `public_baseurl` is set. ([\#7153](#7153)) - Admin users are no longer required to be in a room to create an alias for it. ([\#7191](#7191)) - Require admin privileges to enable room encryption by default. This does not affect existing rooms. ([\#7230](#7230)) - Add a config option for specifying the value of the Accept-Language HTTP header when generating URL previews. ([\#7265](#7265)) - Allow `/requestToken` endpoints to hide the existence (or lack thereof) of 3PID associations on the homeserver. ([\#7315](#7315)) - Add a configuration setting to tweak the threshold for dummy events. ([\#7422](#7422)) Bugfixes -------- - Don't attempt to use an invalid sqlite config if no database configuration is provided. Contributed by @nekatak. ([\#6573](#6573)) - Fix single-sign on with CAS systems: pass the same service URL when requesting the CAS ticket and when calling the `proxyValidate` URL. Contributed by @Naugrimm. ([\#6634](#6634)) - Fix missing field `default` when fetching user-defined push rules. ([\#6639](#6639)) - Improve error responses when accessing remote public room lists. ([\#6899](#6899), [\#7368](#7368)) - Transfer alias mappings on room upgrade. ([\#6946](#6946)) - Ensure that a user interactive authentication session is tied to a single request. ([\#7068](#7068), [\#7455](#7455)) - Fix a bug in the federation API which could cause occasional "Failed to get PDU" errors. ([\#7089](#7089)) - Return the proper error (`M_BAD_ALIAS`) when a non-existant canonical alias is provided. ([\#7109](#7109)) - Fix a bug which meant that groups updates were not correctly replicated between workers. ([\#7117](#7117)) - Fix starting workers when federation sending not split out. ([\#7133](#7133)) - Ensure `is_verified` is a boolean in responses to `GET /_matrix/client/r0/room_keys/keys`. Also warn the user if they forgot the `version` query param. ([\#7150](#7150)) - Fix error page being shown when a custom SAML handler attempted to redirect when processing an auth response. ([\#7151](#7151)) - Avoid importing `sqlite3` when using the postgres backend. Contributed by David Vo. ([\#7155](#7155)) - Fix excessive CPU usage by `prune_old_outbound_device_pokes` job. ([\#7159](#7159)) - Fix a bug which could cause outbound federation traffic to stop working if a client uploaded an incorrect e2e device signature. ([\#7177](#7177)) - Fix a bug which could cause incorrect 'cyclic dependency' error. ([\#7178](#7178)) - Fix a bug that could cause a user to be invited to a server notices (aka System Alerts) room without any notice being sent. ([\#7199](#7199)) - Fix some worker-mode replication handling not being correctly recorded in CPU usage stats. ([\#7203](#7203)) - Do not allow a deactivated user to login via SSO. ([\#7240](#7240), [\#7259](#7259)) - Fix --help command-line argument. ([\#7249](#7249)) - Fix room publish permissions not being checked on room creation. ([\#7260](#7260)) - Reject unknown session IDs during user interactive authentication instead of silently creating a new session. ([\#7268](#7268)) - Fix a SQL query introduced in Synapse 1.12.0 which could cause large amounts of logging to the postgres slow-query log. ([\#7274](#7274)) - Persist user interactive authentication sessions across workers and Synapse restarts. ([\#7302](#7302)) - Fixed backwards compatibility logic of the first value of `trusted_third_party_id_servers` being used for `account_threepid_delegates.email`, which occurs when the former, deprecated option is set and the latter is not. ([\#7316](#7316)) - Fix a bug where event updates might not be sent over replication to worker processes after the stream falls behind. ([\#7337](#7337), [\#7358](#7358)) - Fix bad error handling that would cause Synapse to crash if it's provided with a YAML configuration file that's either empty or doesn't parse into a key-value map. ([\#7341](#7341)) - Fix incorrect metrics reporting for `renew_attestations` background task. ([\#7344](#7344)) - Prevent non-federating rooms from appearing in responses to federated `POST /publicRoom` requests when a filter was included. ([\#7367](#7367)) - Fix a bug which would cause the room durectory to be incorrectly populated if Synapse was upgraded directly from v1.2.1 or earlier to v1.4.0 or later. Note that this fix does not apply retrospectively; see the [upgrade notes](UPGRADE.rst#upgrading-to-v1130) for more information. ([\#7387](#7387)) - Fix bug in `EventContext.deserialize`. ([\#7393](#7393)) - Fix a long-standing bug which could cause messages not to be sent over federation, when state events with state keys matching user IDs (such as custom user statuses) were received. ([\#7376](#7376)) - Restore compatibility with non-compliant clients during the user interactive authentication process, fixing a problem introduced in v1.13.0rc1. ([\#7483](#7483)) - Hash passwords as early as possible during registration. ([\#7523](#7523)) Improved Documentation ---------------------- - Update Debian installation instructions to recommend installing the `virtualenv` package instead of `python3-virtualenv`. ([\#6892](#6892)) - Improve the documentation for database configuration. ([\#6988](#6988)) - Improve the documentation of application service configuration files. ([\#7091](#7091)) - Update pre-built package name for FreeBSD. ([\#7107](#7107)) - Update postgres docs with login troubleshooting information. ([\#7119](#7119)) - Clean up INSTALL.md a bit. ([\#7141](#7141)) - Add documentation for running a local CAS server for testing. ([\#7147](#7147)) - Improve README.md by being explicit about public IP recommendation for TURN relaying. ([\#7167](#7167)) - Fix a small typo in the `metrics_flags` config option. ([\#7171](#7171)) - Update the contributed documentation on managing synapse workers with systemd, and bring it into the core distribution. ([\#7234](#7234)) - Add documentation to the `password_providers` config option. Add known password provider implementations to docs. ([\#7238](#7238), [\#7248](#7248)) - Modify suggested nginx reverse proxy configuration to match Synapse's default file upload size. Contributed by @ProCycleDev. ([\#7251](#7251)) - Documentation of media_storage_providers options updated to avoid misunderstandings. Contributed by Tristan Lins. ([\#7272](#7272)) - Add documentation on monitoring workers with Prometheus. ([\#7357](#7357)) - Clarify endpoint usage in the users admin api documentation. ([\#7361](#7361)) Deprecations and Removals ------------------------- - Remove nonfunctional `captcha_bypass_secret` option from `homeserver.yaml`. ([\#7137](#7137)) Internal Changes ---------------- - Add benchmarks for LruCache. ([\#6446](#6446)) - Return total number of users and profile attributes in admin users endpoint. Contributed by Awesome Technologies Innovationslabor GmbH. ([\#6881](#6881)) - Change device list streams to have one row per ID. ([\#7010](#7010)) - Remove concept of a non-limited stream. ([\#7011](#7011)) - Move catchup of replication streams logic to worker. ([\#7024](#7024), [\#7195](#7195), [\#7226](#7226), [\#7239](#7239), [\#7286](#7286), [\#7290](#7290), [\#7318](#7318), [\#7326](#7326), [\#7378](#7378), [\#7421](#7421)) - Convert some of synapse.rest.media to async/await. ([\#7110](#7110), [\#7184](#7184), [\#7241](#7241)) - De-duplicate / remove unused REST code for login and auth. ([\#7115](#7115)) - Convert `*StreamRow` classes to inner classes. ([\#7116](#7116)) - Clean up some LoggingContext code. ([\#7120](#7120), [\#7181](#7181), [\#7183](#7183), [\#7408](#7408), [\#7426](#7426)) - Add explicit `instance_id` for USER_SYNC commands and remove implicit `conn_id` usage. ([\#7128](#7128)) - Refactored the CAS authentication logic to a separate class. ([\#7136](#7136)) - Run replication streamers on workers. ([\#7146](#7146)) - Add tests for outbound device pokes. ([\#7157](#7157)) - Fix device list update stream ids going backward. ([\#7158](#7158)) - Use `stream.current_token()` and remove `stream_positions()`. ([\#7172](#7172)) - Move client command handling out of TCP protocol. ([\#7185](#7185)) - Move server command handling out of TCP protocol. ([\#7187](#7187)) - Fix consistency of HTTP status codes reported in log lines. ([\#7188](#7188)) - Only run one background database update at a time. ([\#7190](#7190)) - Remove sent outbound device list pokes from the database. ([\#7192](#7192)) - Add a background database update job to clear out duplicate `device_lists_outbound_pokes`. ([\#7193](#7193)) - Remove some extraneous debugging log lines. ([\#7207](#7207)) - Add explicit Python build tooling as dependencies for the snapcraft build. ([\#7213](#7213)) - Add typing information to federation server code. ([\#7219](#7219)) - Extend room admin api (`GET /_synapse/admin/v1/rooms`) with additional attributes. ([\#7225](#7225)) - Unblacklist '/upgrade creates a new room' sytest for workers. ([\#7228](#7228)) - Remove redundant checks on `daemonize` from synctl. ([\#7233](#7233)) - Upgrade jQuery to v3.4.1 on fallback login/registration pages. ([\#7236](#7236)) - Change log line that told user to implement onLogin/onRegister fallback js functions to a warning, instead of an info, so it's more visible. ([\#7237](#7237)) - Correct the parameters of a test fixture. Contributed by Isaiah Singletary. ([\#7243](#7243)) - Convert auth handler to async/await. ([\#7261](#7261)) - Add some unit tests for replication. ([\#7278](#7278)) - Improve typing annotations in `synapse.replication.tcp.streams.Stream`. ([\#7291](#7291)) - Reduce log verbosity of url cache cleanup tasks. ([\#7295](#7295)) - Fix sample SAML Service Provider configuration. Contributed by @frcl. ([\#7300](#7300)) - Fix StreamChangeCache to work with multiple entities changing on the same stream id. ([\#7303](#7303)) - Fix an incorrect import in IdentityHandler. ([\#7319](#7319)) - Reduce logging verbosity for successful federation requests. ([\#7321](#7321)) - Convert some federation handler code to async/await. ([\#7338](#7338)) - Fix collation for postgres for unit tests. ([\#7359](#7359)) - Convert RegistrationWorkerStore.is_server_admin and dependent code to async/await. ([\#7363](#7363)) - Add an `instance_name` to `RDATA` and `POSITION` replication commands. ([\#7364](#7364)) - Thread through instance name to replication client. ([\#7369](#7369)) - Convert synapse.server_notices to async/await. ([\#7394](#7394)) - Convert synapse.notifier to async/await. ([\#7395](#7395)) - Fix issues with the Python package manifest. ([\#7404](#7404)) - Prevent methods in `synapse.handlers.auth` from polling the homeserver config every request. ([\#7420](#7420)) - Speed up fetching device lists changes when handling `/sync` requests. ([\#7423](#7423)) - Run group attestation renewal in series rather than parallel for performance. ([\#7442](#7442)) - Fix linting errors in new version of Flake8. ([\#7470](#7470)) - Update the version of dh-virtualenv we use to build debs, and add focal to the list of target distributions. ([\#7526](#7526))

Make sure that the AccountDataStream presents complete updates, in the right order. This is much the same fix as matrix-org#7337 and matrix-org#7358, but applied to a different stream.

richvdh added 2 commits April 23, 2020 15:44

Remove 'limit' param from get_repl_stream_updates API

23b2826

there doesn't seem to be much point in passing this limit all around, since both sides agree it's meant to be 100.

Make it clear that the limit for an update_function is a target

9cbdfb3

richvdh requested a review from a team April 24, 2020 09:17

erikjohnston suggested changes Apr 24, 2020

View reviewed changes

richvdh requested a review from erikjohnston April 24, 2020 11:40

erikjohnston approved these changes Apr 24, 2020

View reviewed changes

richvdh added 2 commits April 24, 2020 13:59

changelog

3655eaf

richvdh force-pushed the rav/fix_update_limit_assertion branch from 77cb8cb to 3655eaf Compare April 24, 2020 12:59

richvdh merged commit 69a1ac0 into develop Apr 24, 2020

richvdh deleted the rav/fix_update_limit_assertion branch April 24, 2020 13:00

This was referenced Apr 24, 2020

More AssertionErrors from replication_notifier #7340

Closed

Replication fails with AssertionError in replication_notifier #7308

Closed

babolivier mentioned this pull request Apr 25, 2020

Synchrotron stopped relaying new messages #7345

Closed

This was referenced Apr 27, 2020

Fix limit logic for EventsStream #7358

Merged

Fix limit logic for AccountDataStream #7384

Merged

richvdh added a commit that referenced this pull request May 15, 2020

Fix limit logic for AccountDataStream (#7384)

6c1f7c7

Make sure that the AccountDataStream presents complete updates, in the right order. This is much the same fix as #7337 and #7358, but applied to a different stream.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix assertions being thrown by the EventsStream update function #7337

Fix assertions being thrown by the EventsStream update function #7337

richvdh commented Apr 23, 2020

richvdh commented Apr 24, 2020

erikjohnston left a comment

erikjohnston Apr 24, 2020

richvdh Apr 24, 2020

erikjohnston Apr 24, 2020

richvdh Apr 24, 2020

erikjohnston Apr 24, 2020

erikjohnston Apr 24, 2020

richvdh Apr 24, 2020

erikjohnston Apr 24, 2020

erikjohnston Apr 24, 2020

richvdh Apr 24, 2020

erikjohnston Apr 24, 2020

richvdh Apr 24, 2020

richvdh commented Apr 24, 2020

erikjohnston left a comment

erikjohnston commented Apr 24, 2020

Fix assertions being thrown by the EventsStream update function #7337

Fix assertions being thrown by the EventsStream update function #7337

Conversation

richvdh commented Apr 23, 2020

richvdh commented Apr 24, 2020

erikjohnston left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richvdh commented Apr 24, 2020

erikjohnston left a comment

Choose a reason for hiding this comment

erikjohnston commented Apr 24, 2020