feat(mapping): use "index": "not_analyzed" for literal fields

As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. In my testing, for the Portland metro Docker project, disk usage went from 451MB to 473MB, or about a 5% increase. If we wanted to trim that down a bit, we could consider disabling `doc_values` for the `parent.*_id` fields. We don't have an immediate need for `doc_values` on those fields, although it might be interesting for analysis. Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99
pelias · Oct 25, 2018 · eba4340 · eba4340
1 parent 8f5cf10
commit eba4340
Show file tree

Hide file tree

Showing 4 changed files with 260 additions and 260 deletions.
diff --git a/mappings/partial/literal.json b/mappings/partial/literal.json
@@ -1,4 +1,4 @@
 {
   "type": "string",
-  "analyzer": "keyword"
+  "index": "not_analyzed"
 }
diff --git a/test/document.js b/test/document.js
@@ -117,7 +117,7 @@ module.exports.tests.parent_analysis = function(test, common) {
       t.equal(prop[field+'_a'].type, 'string');
       t.equal(prop[field+'_a'].analyzer, 'peliasAdmin');
       t.equal(prop[field+'_id'].type, 'string');
-      t.equal(prop[field+'_id'].analyzer, 'keyword');
+      t.equal(prop[field+'_id'].index, 'not_analyzed');
 
       t.end();
     });
@@ -129,7 +129,7 @@ module.exports.tests.parent_analysis = function(test, common) {
     t.equal(prop['postalcode'+'_a'].type, 'string');
     t.equal(prop['postalcode'+'_a'].analyzer, 'peliasZip');
     t.equal(prop['postalcode'+'_id'].type, 'string');
-    t.equal(prop['postalcode'+'_id'].analyzer, 'keyword');
+    t.equal(prop['postalcode'+'_id'].index, 'not_analyzed');
 
     t.end();
   });