Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Add a dockerfile for running a set of Synapse worker processes #9162

Merged
merged 44 commits into from
Apr 14, 2021

Conversation

anoadragon453
Copy link
Member

@anoadragon453 anoadragon453 commented Jan 19, 2021

This PR adds a Dockerfile and some supporting files to the docker/ directory. The Dockerfile's intention is to spin up a container with:

  • A Synapse main process.
  • Any desired worker processes, defined by a SYNAPSE_WORKER_TYPES environment variable supplied at runtime.
  • A redis for worker communication.
  • A nginx for routing traffic.
  • A supervisord to start all worker processes and monitor them if any go down.

Note that this is not currently intended to be used in production. If you'd like to use Synapse workers with Docker, instead make use of the official image, with one worker per container. The purpose of this dockerfile is currently to allow testing Synapse in worker mode with the Complement test suite.

configure_workers_and_start.py is where most of the magic happens in this PR. It reads from environment variables (documented in the file) and creates all necessary config files for the processes. It is the entrypoint of the Dockerfile, and thus is run any time the docker container is spun up, recreating all config files in case you want to use a different set of workers. One can specify which workers they'd like to use by setting the SYNAPSE_WORKER_TYPES environment variable (as a comma-separated list of arbitrary worker names) or by setting it to * for all worker processes. We will be using the latter in CI.

Huge thanks to @MatMaul for helping get this all working 🎉 This PR is paired with its equivalent on the Complement side: matrix-org/complement#62.

Note, for the purpose of testing this PR before it's merged: You'll need to (re)build the base Synapse docker image for everything to work (matrixdotorg/synapse:latest). Then build the worker-based docker image on top (matrixdotorg/synapse:workers).

TODO:

  • Take off the /dev/null's from postgres and caddy. It can hide startup errors, and they seem to mostly be quiet after startup.

This is probably best reviewed with all commits at once.

@anoadragon453
Copy link
Member Author

anoadragon453 commented Jan 20, 2021

I've found that the federation_sender worker currently causes some tests (TestOutboundFederationSend) to fail. I'm currently investigating this.

@anoadragon453 anoadragon453 force-pushed the anoa/synapse_worker_docker branch from 4aae6f3 to c52ee68 Compare January 21, 2021 14:09
@anoadragon453
Copy link
Member Author

I've found that the federation_sender worker currently causes some tests (TestOutboundFederationSend) to fail. I'm currently investigating this.

Still not sure what's causing this. I propose we continue on with it disabled for now. @MatMaul will investigate further in the coming day. I've written up an issue for this as well: #9192

This was causing worker_listener blocks to get generated with
Nonetype resources, crashing things. This was only a problem for
worker types that don't have any configured resources.
@anoadragon453 anoadragon453 force-pushed the anoa/synapse_worker_docker branch from 1373a76 to e66f9e7 Compare January 21, 2021 16:41
@anoadragon453
Copy link
Member Author

Hm, and now media traffic isn't reaching the media worker... some more debugging needed.

@anoadragon453 anoadragon453 removed the request for review from a team January 22, 2021 14:44
anoadragon453 added a commit to matrix-org/complement that referenced this pull request Jan 23, 2021
MatMaul and I were able to figure out why the media repository wasn't
reachable. Turns out Complement was sending its traffic directly to
the main process instead of nginx. The main process doesn't have
the media resource available by default, so that tipped us off.

Combined with the changes in matrix-org/synapse#9162
outside requests to 8008 now actually get routed to workers
if necessary.

With this, the media tests now work - and we can switch on all
available worker types!
@anoadragon453
Copy link
Member Author

Hm, and now media traffic isn't reaching the media worker... some more debugging needed.

Turns out this was only a problem when using the Complement Dockerfile, and was a result of traffic being sent to the wrong port. Things have been fixed on the Complement side.

Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should restrict this to complement, I don't think its worth having a docker image that sort-of-but-not-really can be used to run workers.

This will also need to support running multiple workers as a given type, e.g. event_persisters, so that we can test sharding. If we're doing this as a complement specific docker image then I'd vote we just support "monolith" or "all the workers" modes.

(I think its probably to ask whether it would have been easier to make complement support docker compose?)

@@ -40,7 +40,7 @@ listeners:
compress: false
{% endif %}

- port: 8008
- port: {{ SYNAPSE_PORT or 8008 }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we making this configurable?

Copy link
Member Author

@anoadragon453 anoadragon453 Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that the reverse proxy can listen on 8008 and route traffic instead. Complement will only contact the homeserver on 8008/8448 (and making it configurable on Complement's end has been rejected).

I can place a comment down if you'd like?

docker/Dockerfile-workers Outdated Show resolved Hide resolved
docker/configure_workers_and_start.py Outdated Show resolved Hide resolved
proxy_pass http://localhost:8080;
proxy_set_header X-Forwarded-For $remote_addr;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why nginx? We use haproxy everywhere else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nginx was the solution that was reached for first, back in the design doc, mostly due to its simple config format. I can replace it with HAProxy instead, though it'd be a bit of a pain.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like if we want to support more than just the basic load-balancing techniques, then we'll likely need to switch over to HAProxy.

I'll check how big of a task this will be, but at the moment I'm leaning towards just supporting round-robin LB for now with nginx, and then switch to HAProxy after the fact. Doing so shouldn't require any changes in Synapse/Complement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nginx does actually support load-balancing by request IP, or by hashing something in the request: https://nginx.org/en/docs/http/ngx_http_upstream_module.html#hash

But actually thinking about it again I don't think we really want any load-balancing techniques other than round-robin for the Complement usecase. We certainly don't want to load-balance by IP, as all requests from our test infra would be coming from localhost.

In fact I think what we really want is consistency in load-balancing between test runs. And round-robin gives us exactly that.

Copy link
Member

@erikjohnston erikjohnston Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to use haproxy rather than nginx is that haproxy is kinda what we "officially" support. It doesn't matter too much but its nice to be consistent

Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should move some of the stuff specific to the worker docker to conf-workers/ instead? It's a bit confusing what is being used for what etc.

I also think this would have been a lot simpler if instead of making it all configurable we just had the majority of it as static config, but I'm also conscious that this has been open for ages now.

docker/README.md Outdated Show resolved Hide resolved
@erikjohnston erikjohnston requested a review from a team April 6, 2021 09:24
@erikjohnston
Copy link
Member

I'd quite like to get thoughts from other people here about whether this is good enough for now.

@anoadragon453
Copy link
Member Author

I wonder if we should move some of the stuff specific to the worker docker to conf-workers/ instead? It's a bit confusing what is being used for what etc.

Good idea.

I also think this would have been a lot simpler if instead of making it all configurable we just had the majority of it as static config, but I'm also conscious that this has been open for ages now.

I think generating config files is still the way to go, rather than including a separate config for each worker. The added complexity on top of that which allows for customising which workers are in use isn't too substantial I don't think.

Plus as mentioned earlier it can be handy to easily be able to specify your own setup of workers, such as when Kegan did for his project recently.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd quite like to get thoughts from other people here about whether this is good enough for now.

Generally, I think it's fine. I've mentioned a few nits below, but I think the main thing I'd like to see changed is to move the docs out of the main docker/README, both because it's a separate docker image, but also because it helps point out the fact that this probably isn't something you want to be using unless you know what you're doing.

docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated
You can read about jemalloc by reading the Synapse [README](../README.md)

### Running all worker processes in a single container for testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be a separate file. This file is copied to https://hub.docker.com/r/matrixdotorg/synapse, and it's not really relevant to the main matrixdotorg/synapse image. (indeed, it would be rather confusing for people reading it at https://hub.docker.com/r/matrixdotorg/synapse)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh indeed, I completely forgot that it ends up on Docker Hub. Good call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I'm struggling to come up with a good place to put it... Perhaps README-testing.md? Or maybe it should go in docs/dev/ or CONTRIBUTING instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with a README-testing.md in the docker dir, since it's mostly describing how the docker image works. We can always add references to it from elsewhere (eg it would be good to stick a comment in the the dockerfile itself).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now done this.

docker/README.md Outdated Show resolved Hide resolved
docker/README.md Outdated Show resolved Hide resolved
docker/conf/log.config Outdated Show resolved Hide resolved
docker/conf/log.config Outdated Show resolved Hide resolved
docker/configure_workers_and_start.py Outdated Show resolved Hide resolved
@anoadragon453 anoadragon453 requested a review from richvdh April 9, 2021 18:28
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm once you fix the nits below

docker/README.md Show resolved Hide resolved
docker/README.md Outdated Show resolved Hide resolved
docker/README-testing.md Outdated Show resolved Hide resolved
docker/README-testing.md Outdated Show resolved Hide resolved
docker/README-testing.md Outdated Show resolved Hide resolved
docker/README-testing.md Outdated Show resolved Hide resolved
docker/configure_workers_and_start.py Outdated Show resolved Hide resolved
@anoadragon453 anoadragon453 merged commit 7e460ec into develop Apr 14, 2021
@anoadragon453 anoadragon453 deleted the anoa/synapse_worker_docker branch April 14, 2021 12:54
anoadragon453 added a commit that referenced this pull request Apr 28, 2021
Synapse 1.33.0rc1 (2021-04-28)
==============================

Features
--------

- Update experimental support for [MSC3083](matrix-org/matrix-spec-proposals#3083): restricting room access via group membership. ([\#9800](#9800), [\#9814](#9814))
- Add experimental support for handling presence on a worker. ([\#9819](#9819), [\#9820](#9820), [\#9828](#9828), [\#9850](#9850))
- Return a new template when an user attempts to renew their account multiple times with the same token, stating that their account is set to expire. This replaces the invalid token template that would previously be shown in this case. This change concerns the optional account validity feature. ([\#9832](#9832))

Bugfixes
--------

- Fixes the OIDC SSO flow when using a `public_baseurl` value including a non-root URL path. ([\#9726](#9726))
- Fix thumbnail generation for some sites with non-standard content types. Contributed by @rkfg. ([\#9788](#9788))
- Add some sanity checks to identity server passed to 3PID bind/unbind endpoints. ([\#9802](#9802))
- Limit the size of HTTP responses read over federation. ([\#9833](#9833))
- Fix a bug which could cause Synapse to get stuck in a loop of resyncing device lists. ([\#9867](#9867))
- Fix a long-standing bug where errors from federation did not propagate to the client. ([\#9868](#9868))

Improved Documentation
----------------------

- Add a note to the docker docs mentioning that we mirror upstream's supported Docker platforms. ([\#9801](#9801))

Internal Changes
----------------

- Add a dockerfile for running Synapse in worker-mode under Complement. ([\#9162](#9162))
- Apply `pyupgrade` across the codebase. ([\#9786](#9786))
- Move some replication processing out of `generic_worker`. ([\#9796](#9796))
- Replace `HomeServer.get_config()` with inline references. ([\#9815](#9815))
- Rename some handlers and config modules to not duplicate the top-level module. ([\#9816](#9816))
- Fix a long-standing bug which caused `max_upload_size` to not be correctly enforced. ([\#9817](#9817))
- Reduce CPU usage of the user directory by reusing existing calculated room membership. ([\#9821](#9821))
- Small speed up for joining large remote rooms. ([\#9825](#9825))
- Introduce flake8-bugbear to the test suite and fix some of its lint violations. ([\#9838](#9838))
- Only store the raw data in the in-memory caches, rather than objects that include references to e.g. the data stores. ([\#9845](#9845))
- Limit length of accepted email addresses. ([\#9855](#9855))
- Remove redundant `synapse.types.Collection` type definition. ([\#9856](#9856))
- Handle recently added rate limits correctly when using `--no-rate-limit` with the demo scripts. ([\#9858](#9858))
- Disable invite rate-limiting by default when running the unit tests. ([\#9871](#9871))
- Pass a reactor into `SynapseSite` to make testing easier. ([\#9874](#9874))
- Make `DomainSpecificString` an `attrs` class. ([\#9875](#9875))
- Add type hints to `synapse.api.auth` and `synapse.api.auth_blocking` modules. ([\#9876](#9876))
- Remove redundant `_PushHTTPChannel` test class. ([\#9878](#9878))
- Remove backwards-compatibility code for Python versions < 3.6. ([\#9879](#9879))
- Small performance improvement around handling new local presence updates. ([\#9887](#9887))
anoadragon453 added a commit to matrix-org/complement that referenced this pull request May 21, 2021
…worker configuration (#105)

This PR updates Complement's Synapse worker-flavoured docker image in a few ways:

* Updates the base image name from `synapse:workers` to `synapse-workers`, as it's built from a separate Dockerfile from the base `matrixdotorg/synapse` image.
* Update the `SYNAPSE_WORKERS` env var name to `SYNAPSE_WORKER_TYPES`
* Switches to an explicit list of worker names. We dropped the `*` functionality after adding sharding capabilities. The worker configuration mirrors [that of sytest](https://github.com/matrix-org/sytest/blob/7d297956d6c5a177d2ea492ce6561b2e2500b291/lib/SyTest/Homeserver/Synapse.pm#L657-L1051).

These changes grew out of [this PR on Synapse](matrix-org/synapse#9162) which changed the base image.

I'd ideally like feedback from the Synapse team on the chosen worker configuration.
babolivier added a commit to matrix-org/synapse-dinsic that referenced this pull request Sep 1, 2021
Synapse 1.33.0 (2021-05-05)
===========================

Features
--------

- Build Debian packages for Ubuntu 21.04 (Hirsute Hippo). ([\#9909](matrix-org/synapse#9909))

Synapse 1.33.0rc2 (2021-04-29)
==============================

Bugfixes
--------

- Fix tight loop when handling presence replication when using workers. Introduced in v1.33.0rc1. ([\#9900](matrix-org/synapse#9900))

Synapse 1.33.0rc1 (2021-04-28)
==============================

Features
--------

- Update experimental support for [MSC3083](matrix-org/matrix-spec-proposals#3083): restricting room access via group membership. ([\#9800](matrix-org/synapse#9800), [\#9814](matrix-org/synapse#9814))
- Add experimental support for handling presence on a worker. ([\#9819](matrix-org/synapse#9819), [\#9820](matrix-org/synapse#9820), [\#9828](matrix-org/synapse#9828), [\#9850](matrix-org/synapse#9850))
- Return a new template when an user attempts to renew their account multiple times with the same token, stating that their account is set to expire. This replaces the invalid token template that would previously be shown in this case. This change concerns the optional account validity feature. ([\#9832](matrix-org/synapse#9832))

Bugfixes
--------

- Fixes the OIDC SSO flow when using a `public_baseurl` value including a non-root URL path. ([\#9726](matrix-org/synapse#9726))
- Fix thumbnail generation for some sites with non-standard content types. Contributed by @rkfg. ([\#9788](matrix-org/synapse#9788))
- Add some sanity checks to identity server passed to 3PID bind/unbind endpoints. ([\#9802](matrix-org/synapse#9802))
- Limit the size of HTTP responses read over federation. ([\#9833](matrix-org/synapse#9833))
- Fix a bug which could cause Synapse to get stuck in a loop of resyncing device lists. ([\#9867](matrix-org/synapse#9867))
- Fix a long-standing bug where errors from federation did not propagate to the client. ([\#9868](matrix-org/synapse#9868))

Improved Documentation
----------------------

- Add a note to the docker docs mentioning that we mirror upstream's supported Docker platforms. ([\#9801](matrix-org/synapse#9801))

Internal Changes
----------------

- Add a dockerfile for running Synapse in worker-mode under Complement. ([\#9162](matrix-org/synapse#9162))
- Apply `pyupgrade` across the codebase. ([\#9786](matrix-org/synapse#9786))
- Move some replication processing out of `generic_worker`. ([\#9796](matrix-org/synapse#9796))
- Replace `HomeServer.get_config()` with inline references. ([\#9815](matrix-org/synapse#9815))
- Rename some handlers and config modules to not duplicate the top-level module. ([\#9816](matrix-org/synapse#9816))
- Fix a long-standing bug which caused `max_upload_size` to not be correctly enforced. ([\#9817](matrix-org/synapse#9817))
- Reduce CPU usage of the user directory by reusing existing calculated room membership. ([\#9821](matrix-org/synapse#9821))
- Small speed up for joining large remote rooms. ([\#9825](matrix-org/synapse#9825))
- Introduce flake8-bugbear to the test suite and fix some of its lint violations. ([\#9838](matrix-org/synapse#9838))
- Only store the raw data in the in-memory caches, rather than objects that include references to e.g. the data stores. ([\#9845](matrix-org/synapse#9845))
- Limit length of accepted email addresses. ([\#9855](matrix-org/synapse#9855))
- Remove redundant `synapse.types.Collection` type definition. ([\#9856](matrix-org/synapse#9856))
- Handle recently added rate limits correctly when using `--no-rate-limit` with the demo scripts. ([\#9858](matrix-org/synapse#9858))
- Disable invite rate-limiting by default when running the unit tests. ([\#9871](matrix-org/synapse#9871))
- Pass a reactor into `SynapseSite` to make testing easier. ([\#9874](matrix-org/synapse#9874))
- Make `DomainSpecificString` an `attrs` class. ([\#9875](matrix-org/synapse#9875))
- Add type hints to `synapse.api.auth` and `synapse.api.auth_blocking` modules. ([\#9876](matrix-org/synapse#9876))
- Remove redundant `_PushHTTPChannel` test class. ([\#9878](matrix-org/synapse#9878))
- Remove backwards-compatibility code for Python versions < 3.6. ([\#9879](matrix-org/synapse#9879))
- Small performance improvement around handling new local presence updates. ([\#9887](matrix-org/synapse#9887))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants