Adapter: Implements RethinkDB as a source of documents #64

alindeman · 2015-03-15T21:11:27Z

Similar to the MongoDB adapter, this proposed RethinkDB source sends all documents in the table, then (if configured via the tail configuration parameter) watches for changes via RethinkDB's Changefeeds feature.

Also implemented is a small change to the ElasticSearch adapter to support Delete operations.

👀 in the form of code review definitely welcomed.

/cc: @brandon-beacher

… the transformer

nstott · 2015-03-16T13:45:30Z

Hey, thanks for this
Getting rethink working as a source will be great.
One of the things with mongo that's different then rethink, is that with mongo we have the timestamp associated with the operation, so we are able to resume streaming ops from a specific point in time if transporter crashes
From what I understand of rethink replication, we aren't able to get that information from the changeset?

The session handling for this is mostly unfinished right now, so i don't see that as being a blocker to getting this in

alindeman · 2015-03-16T15:41:03Z

One of the things with mongo that's different then rethink, is that with mongo we have the timestamp associated with the operation, so we are able to resume streaming ops from a specific point in time if transporter crashes

You are right, there is no support yet for restarting a changes feed in RethinkDB. There's at least one proposal (rethinkdb/rethinkdb#3471), but it is not yet implemented.

That said, as far as I can tell, this is a potential problem with the MongoDB adapter too. Imagine this scenario:

At 00:00, the transporter crashes or loses its connection to MongoDB (for whatever reason: a crash, MongoDB downtime, network issues, hardware issues, etc.)
At 00:10, the transporter starts again. The oplogTime is not persisted between runs, but is instead captured as nowAsMongoTimestamp(). ~10 minutes of oplog will not be read, as far as I can tell anyway.

I believe it's mitigated by the fact that the MongoDB source---and this proposed RethinkDB source too---send all documents before tailing the changes feed. Otherwise, we'd need to keep state around between runs of the transporter and implement some kind of snapshotting mechanism. It might be a good idea to consider this at some point, if only because sending all documents is expensive for large tables, but it's also something that would require some architectural rethinking and would make transporter a stateful service, where it's now stateless.

The most obvious place that it falls down is deletions. If the transporter restarts and a document is deleted during the downtime, we will not be notified of it and will therefore not send a Delete operation through the transporter pipe. In the case of MongoDB, it's because we don't rewind the oplog to the time when it crashed, and in the case of RethinkDB, it's because this kind of feature is not supported.

I decided not to tackle the "deletions during downtime" problem in this PR because I think it will require more fundamental architectural changes, and it is already not well supported in MongoDB. I recommend that we 🏈 it to a new PR, though some documentation around that failure scenario might be appropriate in the mean time.

What do you think? :)

jipperinbham · 2015-03-16T16:37:35Z

That said, as far as I can tell, this is a potential problem with the MongoDB adapter too.

That is true at the moment and some work has begun adaptor-state on being able to optionally add state persistence to transporter but other work and changes to the message.Msg caused us to hold off on it for now.

It might be a good idea to consider this at some point, if only because sending all documents is expensive for large tables, but it's also something that would require some architectural rethinking and would make transporter a stateful service, where it's now stateless.

The current proposal (not really documented other than what has been implemented in the branch) would be that adaptor's would mostly not know or care about State except during startup where the last known good state would be injected into it. The gathering of State happens within the Pipeline and the persistence is something that is controlled within the implemented SessionStore. So, ideally, adaptors would not change much at all other than performing some Resume process on startup.

All that to say, this PR should not be held up by that implementation now and even after we have the concept of State, until (rethinkdb/rethinkdb#3471) is available, the RethinkDB adaptor will be unable to resume from a point in time.

I have a one maybe two suggestions/changes but I'll make them inline relative to the code.

jipperinbham · 2015-03-16T16:40:00Z

pkg/adaptor/rethinkdb.go

@@ -31,10 +33,24 @@ type Rethinkdb struct {
 	client *gorethink.Session
 }

+// rethinkDbConfig provides custom configuration options for the RethinkDB adapter
+type rethinkDbConfig struct {


This needs to be publicly accessible so that it can be properly "Registered" here as:

Register("rethinkdb", "a rethinkdb sink adaptor", NewRethinkdb, RethinkdbConfig{})

<https://github.com/compose/transporter/pull/64/files#r26503070>

alindeman · 2015-03-17T00:43:53Z

@jipperinbham Thanks for the 👀 I've pushed some changes that I believe address your comments.

nstott · 2015-03-17T02:35:36Z

👍

Adapter: Implements RethinkDB as a source of documents

alindeman added 11 commits March 15, 2015 09:11

Initial spike of a RethinkDB sink/destination

f3e1798

Adds deletion support for ElasticSearch adapter

8ac059c

This method is implemented now

ad9bb49

Various cleanups

0d72ff3

Removes prepareDocument: this is better handled by a transformer script

3024b4d

Adds sample rethink -> elasticsearch configuration

e53ceca

Reintroduces prepareDocument because the deletion case short circuits…

eef8994

… the transformer

Cleans up cursors

7bc3f1d

Handles errors gracefully

7f6fc25

Removes unvaluable comment

06ca331

Makes the configuration struct non-exported

c2148ae

jipperinbham reviewed Mar 16, 2015
View reviewed changes

alindeman added 3 commits March 16, 2015 13:35

Registers the custom configuration type

4f067fa

Returns r instead of nil, per request

220185c

<https://github.com/compose/transporter/pull/64/files#r26503070>

Tailing RethinkDB requires >= 1.16

a0f71a9

nstott added a commit that referenced this pull request Mar 17, 2015

Merge pull request #64 from alindeman/alindeman/rethinkdb-source

253c2cc

Adapter: Implements RethinkDB as a source of documents

nstott merged commit 253c2cc into compose:master Mar 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adapter: Implements RethinkDB as a source of documents #64

Adapter: Implements RethinkDB as a source of documents #64

alindeman commented Mar 15, 2015

nstott commented Mar 16, 2015

alindeman commented Mar 16, 2015

jipperinbham commented Mar 16, 2015

jipperinbham Mar 16, 2015

alindeman commented Mar 17, 2015

nstott commented Mar 17, 2015

Adapter: Implements RethinkDB as a source of documents #64

Adapter: Implements RethinkDB as a source of documents #64

Conversation

alindeman commented Mar 15, 2015

nstott commented Mar 16, 2015

alindeman commented Mar 16, 2015

jipperinbham commented Mar 16, 2015

jipperinbham Mar 16, 2015

Choose a reason for hiding this comment

alindeman commented Mar 17, 2015

nstott commented Mar 17, 2015