Combine many small Elasticsearch updates into consolidated batch updates.
Elastic Mill behaves as a faux Elasticsearch endpoint for your updates. It proxies all of your read requests, while intercepting, queuing and combining your updates into batches.
Experimental work in progress: If you feel adventurous, give it a try, it seems to work. Let me know how it goes. Otherwise, give me a few more days to test and polish.
ElasticMill has been designed for deployment on Heroku, using the Bonsai Elasticsearch addon — my tools of choice. But it should run pretty much anywhere.
git clone http://github.com/nz/elasticmill.git # or use your own fork
cd elasticmill
git checkout v1-wip
heroku create --buildpack git://github.com/kr/heroku-buildpack-go.git
heroku addons:add bonsai:starter # or heroku config:add BONSAI_URL=...
git push heroku HEAD:master
heroku run 'curl -X POST $BONSAI_URL/test' # create a test index
You can now use the web URL from heroku apps:info
as the Elasticsearch endpoint for your application.
Configure your application to direct all its updates to this server. Individual document updates will be consolidated and batched with other bulk updates and sent to Elasticsearch's Bulk API.
In general, it is more efficient to update Elasticsearch (and Solr, and Lucene) in fewer consolidated batches, rather than many small and separate updates. This helps consolidate and reduce the overhead from TCP network connections, HTTP request parsing, and JVM garbage collection and cleanup.
However, it can be more efficient and expedient within your application to prepare updates in parallel. Parallelizing is useful to bypass various bottlenecks when loading and serializing your records. It can also be a more natural design pattern to process individual updates to your records as they happen.
The best way to balance these approaches is to use a queue. That way your application gets to leverage the performance benefits of distributing and parallelizing the preparation work, while still consolidating the updates themselves into efficient batch requests for Elasticsearch.
Elasticmill is designed to abstract the implementation of a queue for you. It intercepts many small update requests, and collects and combine them into batches to be sent over to Elasticsearch using its Bulk API.
If you're already queuing and asynchronously batching your updates, probably not. If you have a very low volume of updates, probably not. If your Elasticsearch client natively supports this kind of batches, please tell me about it, because I don't know of any who do this natively.
If you have a moderate volume of sustained updates, with peaks measured in updates per second, or if you would benefit from parallelizing your bulk reindexing, this is for you. Direct all of your updates to this server, and it will handle all the batching for you.
There are a handful of useful enhancements to be made here.
- Tests. Erm, yeah, this thing needs tests. If you find a bug, I'd love a test that duplicates it.
- Logging. This proxy is perfectly positioned to log all sorts of useful metrics.
- External Queuing. The first version is using Go channels for simplicity, effectively queuing in memory. But that presents some mild complications when scaling beyond a single process. It would be nice to use a more robust external queue, such as IronMQ built by my friends at Iron.io
- Solr support. All of these ideas work equally well for Apache Solr.
- Caching and rate limiting. Could be useful!
To contribute:
- Fork the repo.
- Add a test for your change.
- Optionally implement a fix for that change.
- Push your commits in a named remote branch.
- Open a pull request, and optionally email Nick with a link.
Elastic Mill was built by Nick Zadrozny, one of the co-founders of One More Cloud, and a co-creator of the Bonsai Hosted Elasticsearch service and Websolr. Nick does search all day every day.
Its creation was prompted while helping a customer optimize their application's reindexing performance to get the best balance between parallelizing and batching. The design was born out of years of frustration at the difficulty of integrating queuing into existing clients, and a minor epiphany that clients need not care about such implementation details.