-
-
Notifications
You must be signed in to change notification settings - Fork 77
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(schema): add beta ICU tokenizer
- Loading branch information
1 parent
72b4e2b
commit eefcaaa
Showing
10 changed files
with
79 additions
and
81 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
node_modules | ||
npm-debug.log | ||
.DS_Store | ||
config-icu.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
const _ = require('lodash'); | ||
|
||
/** | ||
* This module contains modifications to the Pelias schema to adopt the elastic ICU tokenizer. | ||
* This tokenizer improves word-splitting of non-latin alphabets (particularly Asian languages). | ||
* | ||
* It can be enabled by setting `config.schema.icuTokenizer` in your `pelias.json` config. | ||
* Note: this must be set *before* you create your elasticsearch index or it will have no effect. | ||
* | ||
* This feature is considered beta, we encourage testing & feedback from the community in order | ||
* to adopt the ICU tokenizer as our default. | ||
* | ||
* https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html | ||
* https://github.com/pelias/schema/pull/498 | ||
*/ | ||
|
||
module.exports = (settings) => { | ||
|
||
// replace pattern tokenizer with icu_tokenizer | ||
_.set(settings, 'analysis.tokenizer.peliasTokenizer', { | ||
'type': 'icu_tokenizer' | ||
}); | ||
|
||
// add ampersand_replacer filter | ||
// replaces ampersand placeholders back to `&` (see `ampersand_mapper` char_filter) | ||
_.set(settings, 'analysis.filter.ampersand_replacer', { | ||
'type': 'pattern_replace', | ||
'pattern': 'AMPERSANDPLACEHOLDER', | ||
'replacement': '&' | ||
}); | ||
|
||
// add ampersand_mapper char_filter | ||
// icu-tokenizer treats ampersands as a word boundary, so we replace them with a placeholder to avoid it, | ||
// as we want to handle them separately, we replace them back after tokenization (see `ampersand_replacer` filter) | ||
_.set(settings, 'analysis.char_filter.ampersand_mapper', { | ||
'type': 'pattern_replace', | ||
'pattern': '&', | ||
'replacement': ' AMPERSANDPLACEHOLDER ' | ||
}); | ||
|
||
// prepend ampersand mapper/replacer to each analyzer | ||
_.forEach(_.get(settings, 'analysis.analyzer'), (block) => { | ||
if (block?.tokenizer !== 'peliasTokenizer') { return; } | ||
block.filter.unshift('ampersand_replacer'); | ||
block.char_filter.unshift('ampersand_mapper'); | ||
}); | ||
|
||
return settings; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3056,4 +3056,4 @@ | |
}, | ||
"dynamic": "strict" | ||
} | ||
} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters