-
-
Notifications
You must be signed in to change notification settings - Fork 77
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(schema): add beta ICU tokenizer
* Use ICU tokenizer to improve some Asian languages support * Remove unused import * Add more chinese test cases * Add icuTokenizer flag * Implement ICU tokenizer test * Run unit tests for both ICU = true/false * Run tests for both ICU = true/false * add fixtures * Fix bug in settings * Fix tests * Fix tests * Fix tests * Fix tests * feat(schema): add beta ICU tokenizer --------- Co-authored-by: Peter Johnson <insomnia@rcpt.at>
- Loading branch information
1 parent
41bd2d1
commit 1098354
Showing
13 changed files
with
3,272 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
node_modules | ||
npm-debug.log | ||
.DS_Store | ||
config-icu.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
const _ = require('lodash'); | ||
|
||
/** | ||
* This module contains modifications to the Pelias schema to adopt the elastic ICU tokenizer. | ||
* This tokenizer improves word-splitting of non-latin alphabets (particularly Asian languages). | ||
* | ||
* It can be enabled by setting `config.schema.icuTokenizer` in your `pelias.json` config. | ||
* Note: this must be set *before* you create your elasticsearch index or it will have no effect. | ||
* | ||
* This feature is considered beta, we encourage testing & feedback from the community in order | ||
* to adopt the ICU tokenizer as our default. | ||
* | ||
* https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html | ||
* https://github.com/pelias/schema/pull/498 | ||
*/ | ||
|
||
module.exports = (settings) => { | ||
|
||
// replace pattern tokenizer with icu_tokenizer | ||
_.set(settings, 'analysis.tokenizer.peliasTokenizer', { | ||
'type': 'icu_tokenizer' | ||
}); | ||
|
||
// add ampersand_replacer filter | ||
// replaces ampersand placeholders back to `&` (see `ampersand_mapper` char_filter) | ||
_.set(settings, 'analysis.filter.ampersand_replacer', { | ||
'type': 'pattern_replace', | ||
'pattern': 'AMPERSANDPLACEHOLDER', | ||
'replacement': '&' | ||
}); | ||
|
||
// add ampersand_mapper char_filter | ||
// icu-tokenizer treats ampersands as a word boundary, so we replace them with a placeholder to avoid it, | ||
// as we want to handle them separately, we replace them back after tokenization (see `ampersand_replacer` filter) | ||
_.set(settings, 'analysis.char_filter.ampersand_mapper', { | ||
'type': 'pattern_replace', | ||
'pattern': '&', | ||
'replacement': ' AMPERSANDPLACEHOLDER ' | ||
}); | ||
|
||
// prepend ampersand mapper/replacer to each analyzer | ||
_.forEach(_.get(settings, 'analysis.analyzer'), (block) => { | ||
if (block?.tokenizer !== 'peliasTokenizer') { return; } | ||
block.filter.unshift('ampersand_replacer'); | ||
block.char_filter.unshift('ampersand_mapper'); | ||
}); | ||
|
||
return settings; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
{ | ||
"elasticsearch": { | ||
"settings": { | ||
"index": { | ||
"number_of_replicas": "999", | ||
"number_of_shards": "5", | ||
"refresh_interval": "1m" | ||
} | ||
} | ||
}, | ||
"schema": { | ||
"icuTokenizer": true | ||
} | ||
} | ||
|
Oops, something went wrong.