Consider adding more docs on RethinkDB Proxy #962

mlucy · 2015-11-24T09:57:15Z

I feel like I've been asked about it a lot in the last month, and as far as I can tell we only talk about it as an aside in the changefeed docs. It might be worth adding a page that talks more about the subject (I'm not quite sure where it would go in our existing scheme).

We should probably mention:

What a proxy node is.
Why they exist.
What the performance characteristics are, especially for changefeeds.
How to start one and what the network config needs to look like (specifically the fact that it behaves like a normal RethinkDB node and thus other nodes in the cluster will try to connect to it).
- Once Make it easy to start rethinkdb proxy on system startup. rethinkdb#5138 is resolved we should add docs on how to start a proxy node on system startup too.

@chipotle, any thoughts on whether this is worthwhile and where it should go?

The text was updated successfully, but these errors were encountered:

chipotle · 2015-11-25T19:51:34Z

The main documentation is the (short) section called Running a proxy node in the "Scaling, sharding and replication" document; that's also linked under the Proxy nodes heading in "Optimizing query performance," from the end of the "Command line options" document, and from Scaling considerations in the main Changefeed documentation.

That section is relatively short, and if there's stuff that you think it's missing we can add it. (For instance, it doesn't address "what the network config needs to look like.") And if there are other places we need to link it, we can. Unless there's a lot missing I'm not sure whether this really needs to be in its own document, but if you don't think it belongs under "Scaling" maybe there's a case for a better location.

mlucy · 2015-11-25T23:35:50Z

OK, cool! I actually missed that while look for docs somehow. I think once rethinkdb/rethinkdb#5138 is in we should update that section with information on how to automatically start a proxy node (and probably also add some information about the network configuration while we're in there). Assigning to myself in the meantime.

danielmewes · 2016-03-08T01:24:40Z

@mlucy This is currently assigned to you. Are you still planning to write something up for this or should we reassign?

danielmewes · 2016-03-08T01:32:22Z

I'm going to go ahead and reassign this to myself in order to get some details together. @mlucy Please complain if you have started writing anything up for this.

mlucy · 2016-03-08T01:36:09Z

@danielmewes -- I never did anything on this. We should probably fix rethinkdb/rethinkdb#5138 while we're at this though.

danielmewes · 2016-03-08T22:56:36Z

@chipotle Here's a write up of some of the things that we should probably cover:

What is a RethinkDB proxy?

A RethinkDB proxy is a RethinkDB server that doesn't store any persistent state, but
performs certain aspects of query processing locally.

A typical use case is running a RethinkDB proxy instance locally on each application
server (see figure TODO). We'll take a closer look at typical proxy setups below.

TODO: Maybe insert a figure like the following one that compares a setup without and with proxies.
Specifically it should illustrate the idea of having the proxies run on the application
servers, with the application connecting directly to the proxy instead of to the
individual database servers.

 | 1. Basic setup without proxies |
 ----------------------------------                                      
   ____________________                    ________________                                              
  |     App server 1   |                  |   DB server 1  |                                                  
  |  _______________   |            ----->|                |==                       
  | |  Application  |--|----------->|      ----------------  ||                         
  |  ---------------   |            |                        ||                     
  |                    |            |                        ||                    
   --------------------             |      ________________  ||                                       
                        Client conn.|     |   DB server 2  | || Cluster connections
   ____________________             |---->|                |=||                                                
  |     App server 2   |            |      ----------------  ||                                               
  |  _______________   |            |                        ||                          
  | |  Application  |--|----------->|      ________________  ||                 
  |  ---------------   |            |     |   DB server 3  | ||                     
  |                    |            ----->|                |==                     
   --------------------                    ----------------   


  | 2. Setup with proxies        |
  --------------------------------                                                    
   ____________________                    ________________                                              
  |     App server 1   |                  |   DB server 1  |                                                  
  |  _______________   |                  |                |==                       
  | |  Application  |  |                   ----------------  ||                         
  |  ---------------   |                                     ||                     
  |  _______|_______   |                                     ||                    
  | |     Proxy     |==|=====================================||                    
   --------------------                    ________________  ||                                       
                                          |   DB server 2  | || Cluster connections
   ____________________                   |                |=||                                                
  |     App server 2   |                   ----------------  ||                                               
  |  _______________   |                                     ||                          
  | |  Application  |  |                   ________________  ||                 
  |  ---------------   |                  |   DB server 3  | ||                     
  |  _______|_______   |                  |                |=||                    
  | |     Proxy     |==|=============      ----------------  ||     
   --------------------             \=========================

More precisely, a proxy:

Cannot become a replica for any tables
Does not participate in write or failover majorities (i.e. it cannot be used as an
to enable auto-failover in smaller clusters)
Does not have a server name or any server tags
Does not show up in certain system tables, including the server_config table

However, a RethinkDB proxy still provides the following features:

It listens for client connections.
It can join a cluster and connects directly to all servers in the cluster, so it can
efficiently route read and write operations directly to the responsible replicas.
It parses, interprets and processes queries. While most work will typically be
passed on to the nodes that host the table data, certain operations will be performed
on the proxy itself (for examle non-indexed orderBy, operations on arrays etc.).
It manages changefeeds locally, allowing deduplication of change notifications within
the cluster.

To start a RethinkDB proxy, you can run rethinkdb proxy -j <other server>. See TODO for
more details on this command.

When should I use a RethinkDB proxy?

Primary use cases include:

Scaling changefeeds.
Reducing latencies within the cluster.
Improving throughput of queries where the bottleneck lies in certain aspects of the
query processing, such as query parsing or expensive array-based operations.

We will see in the next section how proxy servers can achieve these objectives.

How can a proxy improve performance?

With a proxy running locally on each application
server, you can avoid an additional network hop by facilitating a proxy's intelligent
routing logic. The proxy always knows which database server holds the data for a
particular request, and can route the request directly to the responsible server.
Compare that to a client connecting directly to a database server.
Since the application usually doesn't know which database server has the data needed
for a particular query, the server handling the query will need to forward the request
further to obtain the data. Two network hops will be required in this scenario, instead
of just one with the local proxy.

Since proxies also perform certain query processing steps themselves, they can also help
scaling those queries more easily. Adding a proxy to a cluster is often easier than
adding a full database server. The proxy will not only handle decoding the client
requests and encoding the query responses, but will also perform a number of calculations
locally. These specifically include many ReQL commands that operate on in-memory arrays,
as well as commands that work on aggregated data (e.g. orderBy without an index,
commands following an ungroup operation etc.).

Changefeeds

In addition to regular queries, a proxy can also be a very powerful tool for scaling
changefeed-heavy applications.

A proxy will manage changefeeds locally, and reduce the overhead (RAM, CPU and network)
on the database servers. For any write operation to a table with active changefeeds,
a database server only sends a single network message to a given proxy. If for example
10,000 clients are listening through a proxy to changes on a particular table, the
database server(s) hosting the table will send a single network message to the proxy when
the table is changed. The proxy then takes care of forwarding the change message to all
10,000 clients locally.

This even works if the clients are listening to different selections on the table, such
as when using the query r.table('test').getAll(val, {index: "idx"}).changes() with
different values val for each changefeed. The proxy will receive one message from the
database servers for every write to the table 'test', and will check locally which
changefeeds are affected by this change.

How do I run it?

A proxy has fewer requirements to the system's hardware compared to a regular database
server. In particular, it doesn't require fast storage and it uses less RAM because it
doesn't need to maintain a local data cache.

That being said, proxies still benefit from fast CPUs, and require enough RAM to process
your application's queries. 256 MB of available RAM is often enough for simple queries
and moderate query throughput. However complex queries and/or a high number of concurrent
operations (including a high number of open changefeeds) might require additional
resources for the proxy server.

Note that in contrast to a regular client that can connect to a single server, a proxy
server must be able to connect directly to all database servers on their intra-cluster
ports (port 29015 by default). If a proxy is unable to connect to all database servers,
some tables might become inaccessible for the proxy and queries using those will fail.

If all requirements are met, you can run a proxy through the rethinkdb proxy command.
See TODO for details.

TODO: Mention/Describe rethinkdb/rethinkdb#5138 once available

TODO: Maybe describe a few "best practice" scenarios in more detail. E.g.
"proxy running on each app server" vs. "adding a central pool of proxies to a cluster".

williamstein · 2016-03-10T04:12:47Z

This is fantastic!! Things to maybe expand on:

Emphasize that you can add/remove proxies at any time without any impact on cluster stability (unlike normal nodes).
Say something about relationship with connection pools; for example, with SageMathCloud I have a (complicated?) connection pool system, where applications make lots of connections to the database, round robin queries using each, destroy connections that are slow, etc. I used this before using proxy nodes. I switched to proxy nodes and I'm still using this connection pool, but maybe it is completely not necessary?
Maybe say something about how potentially more data gets transferred over the network in some cases. Where it says "For any write operation to a table with active changefeeds,..." you give the optimal best case situation. But isn't there a worst case -- maybe the client is listening for a very specific change, and 99.99% of updates to the table don't trigger that; however, with a proxy node, all of those changes get sent to the proxy node. For my personal use case (everything on a local super fast free network inside Google Compute Engine), even this worst case is fine, since the network is so good. But it could matter for some multi-data center deployments.

danielmewes · 2016-03-10T19:25:37Z

Thanks for the feedback @williamstein . Very useful. We'll try to incorporate that.

I think connection pools are still useful, because the proxy is still going to be able to utilize multiple cores better with multiple client connections.

hamiltop · 2016-03-13T01:37:35Z

What about multiple client connections enables better cpu usage in a proxy?

danielmewes · 2016-03-14T20:58:26Z

@hamiltop Yeah that's what I meant. :-)

hamiltop · 2016-03-14T21:03:47Z

@danielmewes Sorry, I was asking a question. Why does that enable better cpu usage? What aspect of multiple client connections leads to more cpu usage?

danielmewes · 2016-03-14T21:10:39Z

Ah, sorry. Each incoming client connection is assigned to one CPU core randomly (or actually round robin I think). A lot of work for any query run through that connection is going to happen on that core.

So by using multiple connections and spreading queries across them, you can better utilize multiple CPU cores on the proxy.

hamiltop · 2016-03-14T21:26:49Z

Interesting. Is that true for normal cluster connections? (not just proxies)

danielmewes · 2016-03-14T21:36:19Z

This is true for normal servers as well.

However on a normal server, there are more tasks that are not depending on
the core handling the connection, and will use their own model for
multi-thread distribution. So it's a little bit less relevant there,
depending on the workload.

On Mon, Mar 14, 2016 at 2:26 PM, Peter Hamilton notifications@github.com
wrote:

Interesting. Is that true for normal cluster connections? (not just
proxies)

—
Reply to this email directly or view it on GitHub
#962 (comment).

danielmewes · 2016-08-30T20:53:24Z

I think most of the information is here. Handing over to @chipotle .

brucepom · 2016-09-09T00:36:09Z

It would also be great to see some recommendations on starting the proxy and restarting it if the process dies or if the server is rebooted. Looking at rethinkdb/rethinkdb#5138 there's a suggestion that this can be done by editing the init script. Not being an expert on this I wasn't that confident to jump into /etc/init.d/rethinkdb and start messing around. I ended up using Upstart instead of altering the init script. I wrote up some notes as I couldn't find an explanation anywhere of how to do this. I'd welcome feedback on the approach I'm certainly not experienced with this.

danielmewes · 2016-09-09T20:56:07Z

Thanks for sharing your notes on this @brucepom ( https://medium.com/@brucepomeroy/running-a-rethinkdb-proxy-on-ubuntu-68f8cd308b7b ).

While we should fix this more generally in the mid-term (rethinkdb/rethinkdb#5138), it might be nice to mention how to add the upstart script for the meantime in our docs. @chipotle do you think that's something we could incorporate?

suru1432002 · 2017-03-24T07:44:46Z

I setup a Rethinkdb proxy on a separate node rather than running proxy in app server itself. So my app will contact proxy node which in turn fetch the data from RethinkDB cluster.

Is there anyway to figure if the query processing is actually happening on proxy machine?

From netstat command i see that my proxy node is connected to some unknow IP (This IP i didn't used/configured anywhere in the network) on port 28015 apart from the cluster nodes.

thomasmodeneis · 2017-04-14T10:29:53Z

Hi,
I wrote a little post about Running a RethinkDB Proxy as Daemon that could be helpful to someone ...

bbar · 2018-10-07T20:23:10Z

Here's my attempt to start RethinkDB as a proxy node using systemd in Ubuntu 16.04. Feel free to add to it...

Install RethinkDB per usual outlined in the documentation. (Don't worry about copying the sample configuration file mentioned here.)
Create a systemd unit file

$ vim /lib/systemd/system/rethinkdb-proxy-node.service

Add the following to the file

[Unit]
Description=RethinkDB proxy node

[Service]
User=rethinkdb
Group=rethinkdb
ExecStart=/usr/bin/rethinkdb proxy --join <SOME_IP_ADDRESS_IN_YOUR_CLUSTER>:29015 --log-file /var/log/rethinkdb/rethinkdb.log --initial-password auto
KillMode=process
PrivateTmp=true

[Install]
WantedBy=multi-user.target

Create log dir + manage permissions

$ sudo mkdir /var/log/rethinkdb
$ sudo chown rethinkdb:rethinkdb /var/log/rethinkdb

Enable run on startup

 $ sudo systemctl enable rethinkdb-proxy-node.service

Start the service

$ sudo systemctl start rethinkdb-proxy-node.service

Other useful things...

Check the status

 $ sudo systemctl status rethinkdb-proxy-node.service

Tail the log

 $ tail -f /var/log/rethinkdb/rethinkdb.log

atris · 2018-10-08T03:18:00Z

@bbar Could you have a pull request for this?

bbar · 2018-10-09T05:10:51Z

@atris sure. In this file, right?

atris · 2018-10-09T11:50:55Z

Yes and yes -- Regards, Atri *l'apprenant*

mlucy self-assigned this Nov 25, 2015

mlucy added the doc label Nov 25, 2015

danielmewes mentioned this issue Feb 19, 2016

Article on scaling changefeeds #1026

Open

danielmewes assigned danielmewes and unassigned mlucy Mar 8, 2016

danielmewes assigned chipotle Aug 30, 2016

brucepom mentioned this issue Sep 9, 2016

Make it easy to start rethinkdb proxy on system startup. rethinkdb/rethinkdb#5138

Open

ChrisTalman mentioned this issue Jun 24, 2018

Fork the database and start afresh? rethinkdb/rethinkdb#6659

Closed

alanhamlett mentioned this issue Sep 14, 2018

cli: client side load balancing through proxy mode cockroachdb/cockroach#24001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding more docs on RethinkDB Proxy #962

Consider adding more docs on RethinkDB Proxy #962

mlucy commented Nov 24, 2015

chipotle commented Nov 25, 2015

mlucy commented Nov 25, 2015

danielmewes commented Mar 8, 2016

danielmewes commented Mar 8, 2016

mlucy commented Mar 8, 2016

danielmewes commented Mar 8, 2016

williamstein commented Mar 10, 2016

danielmewes commented Mar 10, 2016

hamiltop commented Mar 13, 2016

danielmewes commented Mar 14, 2016

hamiltop commented Mar 14, 2016

danielmewes commented Mar 14, 2016

hamiltop commented Mar 14, 2016

danielmewes commented Mar 14, 2016

danielmewes commented Aug 30, 2016

brucepom commented Sep 9, 2016

danielmewes commented Sep 9, 2016

suru1432002 commented Mar 24, 2017

thomasmodeneis commented Apr 14, 2017

bbar commented Oct 7, 2018 •

edited

Loading

atris commented Oct 8, 2018

bbar commented Oct 9, 2018

atris commented Oct 9, 2018 via email

Consider adding more docs on RethinkDB Proxy #962

Consider adding more docs on RethinkDB Proxy #962

Comments

mlucy commented Nov 24, 2015

chipotle commented Nov 25, 2015

mlucy commented Nov 25, 2015

danielmewes commented Mar 8, 2016

danielmewes commented Mar 8, 2016

mlucy commented Mar 8, 2016

danielmewes commented Mar 8, 2016

williamstein commented Mar 10, 2016

danielmewes commented Mar 10, 2016

hamiltop commented Mar 13, 2016

danielmewes commented Mar 14, 2016

hamiltop commented Mar 14, 2016

danielmewes commented Mar 14, 2016

hamiltop commented Mar 14, 2016

danielmewes commented Mar 14, 2016

danielmewes commented Aug 30, 2016

brucepom commented Sep 9, 2016

danielmewes commented Sep 9, 2016

suru1432002 commented Mar 24, 2017

thomasmodeneis commented Apr 14, 2017

bbar commented Oct 7, 2018 • edited Loading

atris commented Oct 8, 2018

bbar commented Oct 9, 2018

atris commented Oct 9, 2018 via email

bbar commented Oct 7, 2018 •

edited

Loading