Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding more docs on RethinkDB Proxy #962

Open
mlucy opened this issue Nov 24, 2015 · 23 comments
Open

Consider adding more docs on RethinkDB Proxy #962

mlucy opened this issue Nov 24, 2015 · 23 comments
Assignees
Labels

Comments

@mlucy
Copy link
Member

mlucy commented Nov 24, 2015

I feel like I've been asked about it a lot in the last month, and as far as I can tell we only talk about it as an aside in the changefeed docs. It might be worth adding a page that talks more about the subject (I'm not quite sure where it would go in our existing scheme).

We should probably mention:

  • What a proxy node is.
  • Why they exist.
  • What the performance characteristics are, especially for changefeeds.
  • How to start one and what the network config needs to look like (specifically the fact that it behaves like a normal RethinkDB node and thus other nodes in the cluster will try to connect to it).

@chipotle, any thoughts on whether this is worthwhile and where it should go?

@chipotle
Copy link
Contributor

The main documentation is the (short) section called Running a proxy node in the "Scaling, sharding and replication" document; that's also linked under the Proxy nodes heading in "Optimizing query performance," from the end of the "Command line options" document, and from Scaling considerations in the main Changefeed documentation.

That section is relatively short, and if there's stuff that you think it's missing we can add it. (For instance, it doesn't address "what the network config needs to look like.") And if there are other places we need to link it, we can. Unless there's a lot missing I'm not sure whether this really needs to be in its own document, but if you don't think it belongs under "Scaling" maybe there's a case for a better location.

@mlucy mlucy self-assigned this Nov 25, 2015
@mlucy mlucy added the doc label Nov 25, 2015
@mlucy
Copy link
Member Author

mlucy commented Nov 25, 2015

OK, cool! I actually missed that while look for docs somehow. I think once rethinkdb/rethinkdb#5138 is in we should update that section with information on how to automatically start a proxy node (and probably also add some information about the network configuration while we're in there). Assigning to myself in the meantime.

@danielmewes
Copy link
Member

@mlucy This is currently assigned to you. Are you still planning to write something up for this or should we reassign?

@danielmewes danielmewes assigned danielmewes and unassigned mlucy Mar 8, 2016
@danielmewes
Copy link
Member

I'm going to go ahead and reassign this to myself in order to get some details together. @mlucy Please complain if you have started writing anything up for this.

@mlucy
Copy link
Member Author

mlucy commented Mar 8, 2016

@danielmewes -- I never did anything on this. We should probably fix rethinkdb/rethinkdb#5138 while we're at this though.

@danielmewes
Copy link
Member

@chipotle Here's a write up of some of the things that we should probably cover:

What is a RethinkDB proxy?

A RethinkDB proxy is a RethinkDB server that doesn't store any persistent state, but
performs certain aspects of query processing locally.

A typical use case is running a RethinkDB proxy instance locally on each application
server (see figure TODO). We'll take a closer look at typical proxy setups below.

TODO: Maybe insert a figure like the following one that compares a setup without and with proxies.
Specifically it should illustrate the idea of having the proxies run on the application
servers, with the application connecting directly to the proxy instead of to the
individual database servers.

 | 1. Basic setup without proxies |
 ----------------------------------                                      
   ____________________                    ________________                                              
  |     App server 1   |                  |   DB server 1  |                                                  
  |  _______________   |            ----->|                |==                       
  | |  Application  |--|----------->|      ----------------  ||                         
  |  ---------------   |            |                        ||                     
  |                    |            |                        ||                    
   --------------------             |      ________________  ||                                       
                        Client conn.|     |   DB server 2  | || Cluster connections
   ____________________             |---->|                |=||                                                
  |     App server 2   |            |      ----------------  ||                                               
  |  _______________   |            |                        ||                          
  | |  Application  |--|----------->|      ________________  ||                 
  |  ---------------   |            |     |   DB server 3  | ||                     
  |                    |            ----->|                |==                     
   --------------------                    ----------------   


  | 2. Setup with proxies        |
  --------------------------------                                                    
   ____________________                    ________________                                              
  |     App server 1   |                  |   DB server 1  |                                                  
  |  _______________   |                  |                |==                       
  | |  Application  |  |                   ----------------  ||                         
  |  ---------------   |                                     ||                     
  |  _______|_______   |                                     ||                    
  | |     Proxy     |==|=====================================||                    
   --------------------                    ________________  ||                                       
                                          |   DB server 2  | || Cluster connections
   ____________________                   |                |=||                                                
  |     App server 2   |                   ----------------  ||                                               
  |  _______________   |                                     ||                          
  | |  Application  |  |                   ________________  ||                 
  |  ---------------   |                  |   DB server 3  | ||                     
  |  _______|_______   |                  |                |=||                    
  | |     Proxy     |==|=============      ----------------  ||     
   --------------------             \=========================                              

More precisely, a proxy:

  • Cannot become a replica for any tables
  • Does not participate in write or failover majorities (i.e. it cannot be used as an
    to enable auto-failover in smaller clusters)
  • Does not have a server name or any server tags
  • Does not show up in certain system tables, including the server_config table

However, a RethinkDB proxy still provides the following features:

  • It listens for client connections.
  • It can join a cluster and connects directly to all servers in the cluster, so it can
    efficiently route read and write operations directly to the responsible replicas.
  • It parses, interprets and processes queries. While most work will typically be
    passed on to the nodes that host the table data, certain operations will be performed
    on the proxy itself (for examle non-indexed orderBy, operations on arrays etc.).
  • It manages changefeeds locally, allowing deduplication of change notifications within
    the cluster.

To start a RethinkDB proxy, you can run rethinkdb proxy -j <other server>. See TODO for
more details on this command.

When should I use a RethinkDB proxy?

Primary use cases include:

  • Scaling changefeeds.
  • Reducing latencies within the cluster.
  • Improving throughput of queries where the bottleneck lies in certain aspects of the
    query processing, such as query parsing or expensive array-based operations.

We will see in the next section how proxy servers can achieve these objectives.

How can a proxy improve performance?

With a proxy running locally on each application
server, you can avoid an additional network hop by facilitating a proxy's intelligent
routing logic. The proxy always knows which database server holds the data for a
particular request, and can route the request directly to the responsible server.
Compare that to a client connecting directly to a database server.
Since the application usually doesn't know which database server has the data needed
for a particular query, the server handling the query will need to forward the request
further to obtain the data. Two network hops will be required in this scenario, instead
of just one with the local proxy.

Since proxies also perform certain query processing steps themselves, they can also help
scaling those queries more easily. Adding a proxy to a cluster is often easier than
adding a full database server. The proxy will not only handle decoding the client
requests and encoding the query responses, but will also perform a number of calculations
locally. These specifically include many ReQL commands that operate on in-memory arrays,
as well as commands that work on aggregated data (e.g. orderBy without an index,
commands following an ungroup operation etc.).

Changefeeds

In addition to regular queries, a proxy can also be a very powerful tool for scaling
changefeed-heavy applications.

A proxy will manage changefeeds locally, and reduce the overhead (RAM, CPU and network)
on the database servers. For any write operation to a table with active changefeeds,
a database server only sends a single network message to a given proxy. If for example
10,000 clients are listening through a proxy to changes on a particular table, the
database server(s) hosting the table will send a single network message to the proxy when
the table is changed. The proxy then takes care of forwarding the change message to all
10,000 clients locally.

This even works if the clients are listening to different selections on the table, such
as when using the query r.table('test').getAll(val, {index: "idx"}).changes() with
different values val for each changefeed. The proxy will receive one message from the
database servers for every write to the table 'test', and will check locally which
changefeeds are affected by this change.

How do I run it?

A proxy has fewer requirements to the system's hardware compared to a regular database
server. In particular, it doesn't require fast storage and it uses less RAM because it
doesn't need to maintain a local data cache.

That being said, proxies still benefit from fast CPUs, and require enough RAM to process
your application's queries. 256 MB of available RAM is often enough for simple queries
and moderate query throughput. However complex queries and/or a high number of concurrent
operations (including a high number of open changefeeds) might require additional
resources for the proxy server.

Note that in contrast to a regular client that can connect to a single server, a proxy
server must be able to connect directly to all database servers on their intra-cluster
ports (port 29015 by default). If a proxy is unable to connect to all database servers,
some tables might become inaccessible for the proxy and queries using those will fail.

If all requirements are met, you can run a proxy through the rethinkdb proxy command.
See TODO for details.

TODO: Mention/Describe rethinkdb/rethinkdb#5138 once available

TODO: Maybe describe a few "best practice" scenarios in more detail. E.g.
"proxy running on each app server" vs. "adding a central pool of proxies to a cluster".

@williamstein
Copy link

This is fantastic!! Things to maybe expand on:

  • Emphasize that you can add/remove proxies at any time without any impact on cluster stability (unlike normal nodes).
  • Say something about relationship with connection pools; for example, with SageMathCloud I have a (complicated?) connection pool system, where applications make lots of connections to the database, round robin queries using each, destroy connections that are slow, etc. I used this before using proxy nodes. I switched to proxy nodes and I'm still using this connection pool, but maybe it is completely not necessary?
  • Maybe say something about how potentially more data gets transferred over the network in some cases. Where it says "For any write operation to a table with active changefeeds,..." you give the optimal best case situation. But isn't there a worst case -- maybe the client is listening for a very specific change, and 99.99% of updates to the table don't trigger that; however, with a proxy node, all of those changes get sent to the proxy node. For my personal use case (everything on a local super fast free network inside Google Compute Engine), even this worst case is fine, since the network is so good. But it could matter for some multi-data center deployments.

@danielmewes
Copy link
Member

Thanks for the feedback @williamstein . Very useful. We'll try to incorporate that.

I think connection pools are still useful, because the proxy is still going to be able to utilize multiple cores better with multiple client connections.

@hamiltop
Copy link

What about multiple client connections enables better cpu usage in a proxy?

@danielmewes
Copy link
Member

@hamiltop Yeah that's what I meant. :-)

@hamiltop
Copy link

@danielmewes Sorry, I was asking a question. Why does that enable better cpu usage? What aspect of multiple client connections leads to more cpu usage?

@danielmewes
Copy link
Member

Ah, sorry. Each incoming client connection is assigned to one CPU core randomly (or actually round robin I think). A lot of work for any query run through that connection is going to happen on that core.

So by using multiple connections and spreading queries across them, you can better utilize multiple CPU cores on the proxy.

@hamiltop
Copy link

Interesting. Is that true for normal cluster connections? (not just proxies)

@danielmewes
Copy link
Member

This is true for normal servers as well.

However on a normal server, there are more tasks that are not depending on
the core handling the connection, and will use their own model for
multi-thread distribution. So it's a little bit less relevant there,
depending on the workload.

On Mon, Mar 14, 2016 at 2:26 PM, Peter Hamilton notifications@github.com
wrote:

Interesting. Is that true for normal cluster connections? (not just
proxies)


Reply to this email directly or view it on GitHub
#962 (comment).

@danielmewes
Copy link
Member

I think most of the information is here. Handing over to @chipotle .

@brucepom
Copy link

brucepom commented Sep 9, 2016

It would also be great to see some recommendations on starting the proxy and restarting it if the process dies or if the server is rebooted. Looking at rethinkdb/rethinkdb#5138 there's a suggestion that this can be done by editing the init script. Not being an expert on this I wasn't that confident to jump into /etc/init.d/rethinkdb and start messing around. I ended up using Upstart instead of altering the init script. I wrote up some notes as I couldn't find an explanation anywhere of how to do this. I'd welcome feedback on the approach I'm certainly not experienced with this.

@danielmewes
Copy link
Member

Thanks for sharing your notes on this @brucepom ( https://medium.com/@brucepomeroy/running-a-rethinkdb-proxy-on-ubuntu-68f8cd308b7b ).

While we should fix this more generally in the mid-term (rethinkdb/rethinkdb#5138), it might be nice to mention how to add the upstart script for the meantime in our docs. @chipotle do you think that's something we could incorporate?

@suru1432002
Copy link

I setup a Rethinkdb proxy on a separate node rather than running proxy in app server itself. So my app will contact proxy node which in turn fetch the data from RethinkDB cluster.

Is there anyway to figure if the query processing is actually happening on proxy machine?

From netstat command i see that my proxy node is connected to some unknow IP (This IP i didn't used/configured anywhere in the network) on port 28015 apart from the cluster nodes.

@thomasmodeneis
Copy link

Hi,
I wrote a little post about Running a RethinkDB Proxy as Daemon that could be helpful to someone ...

@bbar
Copy link

bbar commented Oct 7, 2018

Here's my attempt to start RethinkDB as a proxy node using systemd in Ubuntu 16.04. Feel free to add to it...

  1. Install RethinkDB per usual outlined in the documentation. (Don't worry about copying the sample configuration file mentioned here.)

  2. Create a systemd unit file

$ vim /lib/systemd/system/rethinkdb-proxy-node.service
  1. Add the following to the file
[Unit]
Description=RethinkDB proxy node

[Service]
User=rethinkdb
Group=rethinkdb
ExecStart=/usr/bin/rethinkdb proxy --join <SOME_IP_ADDRESS_IN_YOUR_CLUSTER>:29015 --log-file /var/log/rethinkdb/rethinkdb.log --initial-password auto
KillMode=process
PrivateTmp=true

[Install]
WantedBy=multi-user.target
  1. Create log dir + manage permissions
$ sudo mkdir /var/log/rethinkdb
$ sudo chown rethinkdb:rethinkdb /var/log/rethinkdb
  1. Enable run on startup
 $ sudo systemctl enable rethinkdb-proxy-node.service
  1. Start the service
$ sudo systemctl start rethinkdb-proxy-node.service

Other useful things...

Check the status

 $ sudo systemctl status rethinkdb-proxy-node.service

Tail the log

 $ tail -f /var/log/rethinkdb/rethinkdb.log

@atris
Copy link

atris commented Oct 8, 2018

@bbar Could you have a pull request for this?

@bbar
Copy link

bbar commented Oct 9, 2018

@atris sure. In this file, right?

@atris
Copy link

atris commented Oct 9, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants