Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(mapping): use "index": "not_analyzed" for literal fields
As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99
- Loading branch information