Skip to content

Commit

Permalink
feat(mapping): use "index": "not_analyzed" for literal fields
Browse files Browse the repository at this point in the history
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

In my testing, for the Portland metro Docker project, disk usage went
from 451MB to 473MB, or about a 5% increase.

If we wanted to trim that down a bit, we could consider disabling
`doc_values` for the `parent.*_id` fields. We don't have an immediate
need for `doc_values` on those fields, although it might be interesting
for analysis.

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
  • Loading branch information
orangejulius committed Oct 25, 2018
1 parent 8f5cf10 commit eba4340
Show file tree
Hide file tree
Showing 4 changed files with 260 additions and 260 deletions.
2 changes: 1 addition & 1 deletion mappings/partial/literal.json
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
{
"type": "string",
"analyzer": "keyword"
"index": "not_analyzed"
}
4 changes: 2 additions & 2 deletions test/document.js
Original file line number Diff line number Diff line change
Expand Up @@ -117,7 +117,7 @@ module.exports.tests.parent_analysis = function(test, common) {
t.equal(prop[field+'_a'].type, 'string');
t.equal(prop[field+'_a'].analyzer, 'peliasAdmin');
t.equal(prop[field+'_id'].type, 'string');
t.equal(prop[field+'_id'].analyzer, 'keyword');
t.equal(prop[field+'_id'].index, 'not_analyzed');

t.end();
});
Expand All @@ -129,7 +129,7 @@ module.exports.tests.parent_analysis = function(test, common) {
t.equal(prop['postalcode'+'_a'].type, 'string');
t.equal(prop['postalcode'+'_a'].analyzer, 'peliasZip');
t.equal(prop['postalcode'+'_id'].type, 'string');
t.equal(prop['postalcode'+'_id'].analyzer, 'keyword');
t.equal(prop['postalcode'+'_id'].index, 'not_analyzed');

t.end();
});
Expand Down
Loading

0 comments on commit eba4340

Please sign in to comment.