Explore using not_analyzed for source #99

orangejulius · 2016-01-28T19:07:25Z

The source field currently uses the "keyword" analyzer, which basically keeps the full string as a single token with no changes. According to the keyword analyzer docs, it might make more sense to use the "not_analyzed" setting, briefly touched on here in the docs. It seems like it might do the same thing while somehow being faster.

orangejulius · 2018-08-27T01:05:53Z

Update: we've learned a bit about Elasticsearch since writing this issue. It's unlikely keyword vs not_analyzed setting will make a difference. However we should go through and ensure we aren't indexing any fields that we don't need to search on.

Background ========== I always thought that it was important to use the `store` parameter to specify whether a field should be stored, in addition to indexing, and the default was to not store a field for later retrieval. It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html) field by default. Setting `"store": "yes"` is only needed if, in addition to getting a field back as part of the `_source` (which contains _every_ field in the document), we wanted to be able to return a single field. Pelias doesn't currently do this, we always ask Elasticsearch for the entire `_source` field. In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html) feature, so if we ever wanted to return only some of `_source` (which might someday be the case with something like pelias/api#1121), the only reason we would want to bother with `"store": "yes"` is if the size of the `_source` field was so prohitibive we didn't even want Elasticsearch to fetch all of it from disk. That might be a concern some day, but not today. Changes ========== This PR removes all `"store": "yes"` parameters for all of our different fields. In my testing of the Portland, Oregon Docker project, which has about 1.8 million documents, this change reduces the disk space usage from 551MB to 492MB, or about 10%! After this change, I'm now pretty confident we are doing the right thing for all our fields when it comes to storing, and analyzers so this closes #99 Fixes #99

Background ========== I always thought that it was important to use the `store` parameter to specify whether a field should be stored, in addition to indexing, and the default was to not store a field for later retrieval. It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html) field by default. Setting `"store": "yes"` is only needed if, in addition to getting a field back as part of the `_source` (which contains _every_ field in the document), we wanted to be able to return a single field. Pelias doesn't currently do this, we always ask Elasticsearch for the entire `_source` field. In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html) feature, so if we ever wanted to return only some of `_source` (which might someday be the case with something like pelias/api#1121), the only reason we would want to bother with `"store": "yes"` is if the size of the `_source` field was so prohibitive we didn't even want Elasticsearch to fetch all of it from disk. That might be a concern some day, but not today. Changes ========== This PR removes all `"store": "yes"` parameters for all of our fields. In my testing of the Portland, Oregon Docker project, which has about 1.8 million documents, this change reduces the disk space usage from 551MB to 492MB, or about 10%! After this change, I'm now pretty confident we are doing the right thing for all our fields when it comes to storing, and analyzers so this closes #99

Background ========== I always thought that it was important to use the `store` parameter to specify whether a field should be stored, in addition to indexing, and the default was to not store a field for later retrieval. It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html) field by default. Setting `"store": "yes"` is only needed if, in addition to getting a field back as part of the `_source` (which contains _every_ field in the document), we wanted to be able to return a single field. Pelias doesn't currently do this, we always ask Elasticsearch for the entire `_source` field. In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html) feature, so if we ever wanted to return only some of `_source` (which might someday be the case with something like pelias/api#1121), the only reason we would want to bother with `"store": "yes"` is if the size of the `_source` field was so prohibitive we didn't even want Elasticsearch to fetch all of it from disk. That might be a concern some day, but not today. Changes ========== This PR removes all `"store": "yes"` parameters for all of our fields. Efectively, we were storing a lot of fields on disk twice, which was wasting space. In my testing of the Portland, Oregon Docker project, which has about 1.8 million documents, this change reduces the disk space usage from 551MB to 492MB, or about 10%! _Sidenote:_ If there are other fields we _do_ want to keep out of the `_source` field, [`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it. After this change, I'm now pretty confident we are doing the right thing for all our fields when it comes to storing, and analyzers so this closes #99

Background ========== I always thought that it was important to use the `store` parameter to specify whether a field should be stored, in addition to indexing, and the default was to not store a field for later retrieval. It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html) field by default. Setting `"store": "yes"` is only needed if, in addition to getting a field back as part of the `_source` (which contains _every_ field in the document), we wanted to be able to return a single field. Pelias doesn't currently do this, we always ask Elasticsearch for the entire `_source` field. In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html) feature, so if we ever wanted to return only some of `_source` (which might someday be the case with something like pelias/api#1121), the only reason we would want to bother with `"store": "yes"` is if the size of the `_source` field was so prohibitive we didn't even want Elasticsearch to fetch all of it from disk. That might be a concern some day, but not today. Changes ========== This PR removes all `"store": "yes"` parameters for all of our fields. Effectively, we were storing a lot of fields on disk twice, which was wasting space. In my testing of the Portland, Oregon Docker project, which has about 1.8 million documents, this change reduces the disk space usage from 551MB to 492MB, or about 10%! _Sidenote:_ If there are other fields we _do_ want to keep out of the `_source` field, [`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it. After this change, I'm now pretty confident we are doing the right thing for all our fields when it comes to storing, and analyzers so this closes #99

As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. In my testing, for the Portland metro Docker project, disk usage went from 451MB to 473MB, or about a 5% increase. If we wanted to trim that down a bit, we could consider disabling `doc_values` for the `parent.*_id` fields. We don't have an immediate need for `doc_values` on those fields, although it might be interesting for analysis. Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99

Background ========== I always thought that it was important to use the `store` parameter to specify whether a field should be stored, in addition to indexing, and the default was to not store a field for later retrieval. It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html) field by default. Setting `"store": "yes"` is only needed if, in addition to getting a field back as part of the `_source` (which contains _every_ field in the document), we wanted to be able to return a single field. Pelias doesn't currently do this, we always ask Elasticsearch for the entire `_source` field. In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html) feature, so if we ever wanted to return only some of `_source` (which might someday be the case with something like pelias/api#1121), the only reason we would want to bother with `"store": "yes"` is if the size of the `_source` field was so prohibitive we didn't even want Elasticsearch to fetch all of it from disk. That might be a concern some day, but not today. Changes ========== This PR removes all `"store": "yes"` parameters for all of our fields. Effectively, we were storing a lot of fields on disk twice, which was wasting space. In my testing of the Portland, Oregon Docker project, which has about 1.8 million documents, this change reduces the disk space usage from 551MB to 492MB, or about 10%! _Sidenote:_ If there are other fields we _do_ want to keep out of the `_source` field, [`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it. After this change, I'm now pretty confident we are doing the right thing for all our fields when it comes to storing, and analyzers so this closes pelias#99

As guessed in pelias#99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](pelias#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. In my testing, for the Portland metro Docker project, disk usage went from 451MB to 473MB, or about a 5% increase. If we wanted to trim that down a bit, we could consider disabling `doc_values` for the `parent.*_id` fields. We don't have an immediate need for `doc_values` on those fields, although it might be interesting for analysis. Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes pelias#99

As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99

Background ========== I always thought that it was important to use the `store` parameter to specify whether a field should be stored, in addition to indexing, and the default was to not store a field for later retrieval. It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html) field by default. Setting `"store": "yes"` is only needed if, in addition to getting a field back as part of the `_source` (which contains _every_ field in the document), we wanted to be able to return a single field. Pelias doesn't currently do this, we always ask Elasticsearch for the entire `_source` field. In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html) feature, so if we ever wanted to return only some of `_source` (which might someday be the case with something like pelias/api#1121), the only reason we would want to bother with `"store": "yes"` is if the size of the `_source` field was so prohibitive we didn't even want Elasticsearch to fetch all of it from disk. That might be a concern some day, but not today. Changes ========== This PR removes all `"store": "yes"` parameters for all of our fields. Effectively, we were storing a lot of fields on disk twice, which was wasting space. In my testing of the Portland, Oregon Docker project, which has about 1.8 million documents, this change reduces the disk space usage from 551MB to 492MB, or about 10%! _Sidenote:_ If there are other fields we _do_ want to keep out of the `_source` field, [`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it. After this change, I'm now pretty confident we are doing the right thing for all our fields when it comes to storing, and analyzers so this closes #99

As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99

Background ========== I always thought that it was important to use the `store` parameter to specify whether a field should be stored, in addition to indexing, and the default was to not store a field for later retrieval. It turns out this isn't true, and that all fields are [copied to the _source](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/mapping-store.html) field by default. Setting `"store": "yes"` is only needed if, in addition to getting a field back as part of the `_source` (which contains _every_ field in the document), we wanted to be able to return a single field. Pelias doesn't currently do this, we always ask Elasticsearch for the entire `_source` field. In addition, Elasticsearch has a [source filtering](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/search-request-source-filtering.html) feature, so if we ever wanted to return only some of `_source` (which might someday be the case with something like pelias/api#1121), the only reason we would want to bother with `"store": "yes"` is if the size of the `_source` field was so prohibitive we didn't even want Elasticsearch to fetch all of it from disk. That might be a concern some day, but not today. Changes ========== This PR removes all `"store": "yes"` parameters for all of our fields. Effectively, we were storing a lot of fields on disk twice, which was wasting space. In my testing of the Portland, Oregon Docker project, which has about 1.8 million documents, this change reduces the disk space usage from 551MB to 492MB, or about 10%! _Sidenote:_ If there are other fields we _do_ want to keep out of the `_source` field, [`_source.exclude`](https://github.com/pelias/schema/blob/master/mappings/document.js#L158-L159) in our document mapping is how we can do it. After this change, I'm now pretty confident we are doing the right thing for all our fields when it comes to storing, and analyzers so this closes #99

As guessed in #99, there _are_ differences between setting `"index": "not_analyzed"` for a field, and merely setting the analyzer to `keyword`. They are detailed in the Elasticsearch 2.4 [String datatype](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/string.html#string-params) documentation, although it's a little bit confusing. In Elasticsearch 5+, there are _two_ different types of string datatypes: - [`text`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/text.html) and - [`keyword`](https://www.elastic.co/guide/en/elasticsearch/reference/6.4/keyword.html). These documentation pages make the difference much more clear. In short, in Elasticsearch 2.4, setting `"index": "not_analyzed"` gives the following changes, all of which we'd like for these literal fields: - Analysis is skipped all together, the raw value is added to the index directly (this is pretty much equivalent to setting `analyzer: keyword`) - [norms](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/norms.html) are disabled for the field, saving some disk space - [doc_values](https://www.elastic.co/guide/en/elasticsearch/reference/2.4/doc-values.html) are _en_abled. The last one is most interesting. In short, doc_values take up a little disk space but allow us to very efficiently perform aggregations. Pelias doesn't generally perform aggregations today. However, after we begin using a [single mapping type](#293), we will have no way for the [pelias dashboard](https://github.com/pelias/dashboard) or any of our own analysis scripts to provide document counts for different sources or layers. The dasbhoard currently uses an API to get the count of various mapping types, which won't be supported going forward. While minor, we needed a solution to this, and the only other one is fielddata which is extremely expensive in terms of memory usage. This PR disables doc_values for all fields except `source` and `layer`, which gives us about a 4% disk space savings. Merely changing the literal field to use `not_analyzed` _increases_ disk space goes up around 3%, so this is roughly a 7% win! Summary ------ While not technically required for [Elasticsearch 5 support](pelias/pelias#461), this PR does bring us more in line with the best practices of ES5. It also sets us up for [Elasticsearch 6](pelias/pelias#719) where the `string` datatype we use now is completely removed. Fixes #99

orangejulius added low priority experiment processed labels Jan 28, 2016

dianashk modified the milestone: Experiments Q2 Apr 19, 2016

orangejulius mentioned this issue Oct 23, 2018

feat(mapping): Remove store mapping parameter #329

Merged

orangejulius mentioned this issue Oct 25, 2018

feat(mapping): use "index": "not_analyzed" for literal fields #331

Merged

orangejulius closed this as completed in 2289253 Nov 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore using not_analyzed for source #99

Explore using not_analyzed for source #99

orangejulius commented Jan 28, 2016 •

edited by dianashk

Loading

orangejulius commented Aug 27, 2018

Explore using not_analyzed for source #99

Explore using not_analyzed for source #99

Comments

orangejulius commented Jan 28, 2016 • edited by dianashk Loading

orangejulius commented Aug 27, 2018

orangejulius commented Jan 28, 2016 •

edited by dianashk

Loading