Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore using not_analyzed for source #99

Closed
orangejulius opened this issue Jan 28, 2016 · 1 comment
Closed

Explore using not_analyzed for source #99

orangejulius opened this issue Jan 28, 2016 · 1 comment

Comments

@orangejulius
Copy link
Member

orangejulius commented Jan 28, 2016

The source field currently uses the "keyword" analyzer, which basically keeps the full string as a single token with no changes. According to the keyword analyzer docs, it might make more sense to use the "not_analyzed" setting, briefly touched on here in the docs. It seems like it might do the same thing while somehow being faster.

@orangejulius
Copy link
Member Author

Update: we've learned a bit about Elasticsearch since writing this issue. It's unlikely keyword vs not_analyzed setting will make a difference. However we should go through and ensure we aren't indexing any fields that we don't need to search on.

orangejulius added a commit that referenced this issue Oct 23, 2018
Background
==========

I always thought that it was important to use the `store` parameter to
specify whether a field should be stored, in addition to indexing, and
the default was to not store a field for later retrieval.

It turns out this isn't true, and that all fields are [copied to the
_source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html)
field by default.

Setting `"store": "yes"` is only needed if, in addition to getting a
field back as part of the `_source` (which contains _every_ field
in the document), we wanted to be able to return a single field. Pelias
doesn't currently do this, we always ask Elasticsearch for the entire
`_source` field.

In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html)
feature, so if we ever wanted to return only some of `_source` (which
might someday be the case with something like
pelias/api#1121), the only reason we would
want to bother with `"store": "yes"` is if the size of the `_source`
field was so prohitibive we didn't even want Elasticsearch to fetch all
of it from disk. That might be a concern some day, but not today.

Changes
==========

This PR removes all `"store": "yes"` parameters for all of our different
fields. In my testing of the Portland, Oregon Docker project, which has
about 1.8 million documents, this change reduces the disk space usage
from 551MB to 492MB, or about 10%!

After this change, I'm now pretty confident we are doing the right thing
for all our fields when it comes to storing, and analyzers so this
closes #99

Fixes #99
orangejulius added a commit that referenced this issue Oct 23, 2018
Background
==========

I always thought that it was important to use the `store` parameter to
specify whether a field should be stored, in addition to indexing, and
the default was to not store a field for later retrieval.

It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html)
field by default.

Setting `"store": "yes"` is only needed if, in addition to getting a
field back as part of the `_source` (which contains _every_ field
in the document), we wanted to be able to return a single field. Pelias
doesn't currently do this, we always ask Elasticsearch for the entire
`_source` field.

In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html)
feature, so if we ever wanted to return only some of `_source` (which
might someday be the case with something like
pelias/api#1121), the only reason we would
want to bother with `"store": "yes"` is if the size of the `_source`
field was so prohibitive we didn't even want Elasticsearch to fetch all
of it from disk. That might be a concern some day, but not today.

Changes
==========

This PR removes all `"store": "yes"` parameters for all of our fields.
In my testing of the Portland, Oregon Docker project, which has about
1.8 million documents, this change reduces the disk space usage from
551MB to 492MB, or about 10%!

After this change, I'm now pretty confident we are doing the right thing
for all our fields when it comes to storing, and analyzers so this
closes #99
orangejulius added a commit that referenced this issue Oct 23, 2018
Background
==========

I always thought that it was important to use the `store` parameter to
specify whether a field should be stored, in addition to indexing, and
the default was to not store a field for later retrieval.

It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html)
field by default.

Setting `"store": "yes"` is only needed if, in addition to getting a
field back as part of the `_source` (which contains _every_ field
in the document), we wanted to be able to return a single field. Pelias
doesn't currently do this, we always ask Elasticsearch for the entire
`_source` field.

In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html)
feature, so if we ever wanted to return only some of `_source` (which
might someday be the case with something like
pelias/api#1121), the only reason we would
want to bother with `"store": "yes"` is if the size of the `_source`
field was so prohibitive we didn't even want Elasticsearch to fetch all
of it from disk. That might be a concern some day, but not today.

Changes
==========

This PR removes all `"store": "yes"` parameters for all of our fields.

Efectively, we were storing a lot of fields on disk twice, which was
wasting space.

In my testing of the Portland, Oregon Docker project, which has about
1.8 million documents, this change reduces the disk space usage from
551MB to 492MB, or about 10%!

_Sidenote:_ If there are other fields we _do_ want to keep out of the
`_source` field,
[`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it.

After this change, I'm now pretty confident we are doing the right thing
for all our fields when it comes to storing, and analyzers so this
closes #99
orangejulius added a commit that referenced this issue Oct 23, 2018
Background
==========

I always thought that it was important to use the `store` parameter to
specify whether a field should be stored, in addition to indexing, and
the default was to not store a field for later retrieval.

It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html)
field by default.

Setting `"store": "yes"` is only needed if, in addition to getting a
field back as part of the `_source` (which contains _every_ field
in the document), we wanted to be able to return a single field. Pelias
doesn't currently do this, we always ask Elasticsearch for the entire
`_source` field.

In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html)
feature, so if we ever wanted to return only some of `_source` (which
might someday be the case with something like
pelias/api#1121), the only reason we would
want to bother with `"store": "yes"` is if the size of the `_source`
field was so prohibitive we didn't even want Elasticsearch to fetch all
of it from disk. That might be a concern some day, but not today.

Changes
==========

This PR removes all `"store": "yes"` parameters for all of our fields.

Effectively, we were storing a lot of fields on disk twice, which was
wasting space.

In my testing of the Portland, Oregon Docker project, which has about
1.8 million documents, this change reduces the disk space usage from
551MB to 492MB, or about 10%!

_Sidenote:_ If there are other fields we _do_ want to keep out of the
`_source` field,
[`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it.

After this change, I'm now pretty confident we are doing the right thing
for all our fields when it comes to storing, and analyzers so this
closes #99
orangejulius added a commit that referenced this issue Oct 25, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

In my testing, for the Portland metro Docker project, disk usage went
from 451MB to 473MB, or about a 5% increase.

If we wanted to trim that down a bit, we could consider disabling
`doc_values` for the `parent.*_id` fields. We don't have an immediate
need for `doc_values` on those fields, although it might be interesting
for analysis.

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
JWileczek pushed a commit to JWileczek/schema that referenced this issue Oct 26, 2018
Background
==========

I always thought that it was important to use the `store` parameter to
specify whether a field should be stored, in addition to indexing, and
the default was to not store a field for later retrieval.

It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html)
field by default.

Setting `"store": "yes"` is only needed if, in addition to getting a
field back as part of the `_source` (which contains _every_ field
in the document), we wanted to be able to return a single field. Pelias
doesn't currently do this, we always ask Elasticsearch for the entire
`_source` field.

In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html)
feature, so if we ever wanted to return only some of `_source` (which
might someday be the case with something like
pelias/api#1121), the only reason we would
want to bother with `"store": "yes"` is if the size of the `_source`
field was so prohibitive we didn't even want Elasticsearch to fetch all
of it from disk. That might be a concern some day, but not today.

Changes
==========

This PR removes all `"store": "yes"` parameters for all of our fields.

Effectively, we were storing a lot of fields on disk twice, which was
wasting space.

In my testing of the Portland, Oregon Docker project, which has about
1.8 million documents, this change reduces the disk space usage from
551MB to 492MB, or about 10%!

_Sidenote:_ If there are other fields we _do_ want to keep out of the
`_source` field,
[`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it.

After this change, I'm now pretty confident we are doing the right thing
for all our fields when it comes to storing, and analyzers so this
closes pelias#99
JWileczek pushed a commit to JWileczek/schema that referenced this issue Oct 26, 2018
As guessed in pelias#99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](pelias#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

In my testing, for the Portland metro Docker project, disk usage went
from 451MB to 473MB, or about a 5% increase.

If we wanted to trim that down a bit, we could consider disabling
`doc_values` for the `parent.*_id` fields. We don't have an immediate
need for `doc_values` on those fields, although it might be interesting
for analysis.

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes pelias#99
orangejulius added a commit that referenced this issue Nov 2, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
orangejulius added a commit that referenced this issue Nov 2, 2018
Background
==========

I always thought that it was important to use the `store` parameter to
specify whether a field should be stored, in addition to indexing, and
the default was to not store a field for later retrieval.

It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html)
field by default.

Setting `"store": "yes"` is only needed if, in addition to getting a
field back as part of the `_source` (which contains _every_ field
in the document), we wanted to be able to return a single field. Pelias
doesn't currently do this, we always ask Elasticsearch for the entire
`_source` field.

In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html)
feature, so if we ever wanted to return only some of `_source` (which
might someday be the case with something like
pelias/api#1121), the only reason we would
want to bother with `"store": "yes"` is if the size of the `_source`
field was so prohibitive we didn't even want Elasticsearch to fetch all
of it from disk. That might be a concern some day, but not today.

Changes
==========

This PR removes all `"store": "yes"` parameters for all of our fields.

Effectively, we were storing a lot of fields on disk twice, which was
wasting space.

In my testing of the Portland, Oregon Docker project, which has about
1.8 million documents, this change reduces the disk space usage from
551MB to 492MB, or about 10%!

_Sidenote:_ If there are other fields we _do_ want to keep out of the
`_source` field,
[`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it.

After this change, I'm now pretty confident we are doing the right thing
for all our fields when it comes to storing, and analyzers so this
closes #99
orangejulius added a commit that referenced this issue Nov 2, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
orangejulius added a commit that referenced this issue Nov 3, 2018
Background
==========

I always thought that it was important to use the `store` parameter to
specify whether a field should be stored, in addition to indexing, and
the default was to not store a field for later retrieval.

It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html)
field by default.

Setting `"store": "yes"` is only needed if, in addition to getting a
field back as part of the `_source` (which contains _every_ field
in the document), we wanted to be able to return a single field. Pelias
doesn't currently do this, we always ask Elasticsearch for the entire
`_source` field.

In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html)
feature, so if we ever wanted to return only some of `_source` (which
might someday be the case with something like
pelias/api#1121), the only reason we would
want to bother with `"store": "yes"` is if the size of the `_source`
field was so prohibitive we didn't even want Elasticsearch to fetch all
of it from disk. That might be a concern some day, but not today.

Changes
==========

This PR removes all `"store": "yes"` parameters for all of our fields.

Effectively, we were storing a lot of fields on disk twice, which was
wasting space.

In my testing of the Portland, Oregon Docker project, which has about
1.8 million documents, this change reduces the disk space usage from
551MB to 492MB, or about 10%!

_Sidenote:_ If there are other fields we _do_ want to keep out of the
`_source` field,
[`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it.

After this change, I'm now pretty confident we are doing the right thing
for all our fields when it comes to storing, and analyzers so this
closes #99
orangejulius added a commit that referenced this issue Nov 3, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
orangejulius added a commit that referenced this issue Nov 3, 2018
As guessed in #99, there _are_
differences between setting `"index": "not_analyzed"` for a field, and
merely setting the analyzer to `keyword`.

They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params)
documentation, although it's a little bit confusing.

In Elasticsearch 5+, there are _two_ different types of string
datatypes:

- [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and
- [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html).

These documentation pages make the difference much more clear. In short,
in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the
following changes, all of which we'd like for these literal fields:

- Analysis is skipped all together, the raw value is added to the index
directly (this is pretty much equivalent to setting `analyzer: keyword`)
- [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space
- [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled.

The last one is most interesting. In short, doc_values take up a little
disk space but allow us to very efficiently perform aggregations. Pelias
doesn't generally perform aggregations today. However, after we begin
using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward.

While minor, we needed a solution to this, and the only other one is
fielddata which is extremely expensive in terms of memory usage.

This PR disables doc_values for all fields except `source` and `layer`,
which gives us about a 4% disk space savings. Merely changing the literal
field to use `not_analyzed` _increases_ disk space goes up around 3%, so
this is roughly a 7% win!

Summary
------

While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5.

It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string`
datatype we use now is completely removed.

Fixes #99
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants