-
-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ICU tokenizer to improve some Asian languages support #498
Merged
missinglink
merged 14 commits into
pelias:master
from
SiarheiFedartsou:sf-icu-tokenizer3
Feb 4, 2025
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
5074b6b
Use ICU tokenizer to improve some Asian languages support
SiarheiFedartsou a91418e
Remove unused import
SiarheiFedartsou 6e8feea
Add more chinese test cases
SiarheiFedartsou 7902972
Add icuTokenizer flag
SiarheiFedartsou f8f844c
Implement ICU tokenizer test
SiarheiFedartsou 0275458
Run unit tests for both ICU = true/false
SiarheiFedartsou 1384119
Run tests for both ICU = true/false
SiarheiFedartsou cc3765f
add fixtures
SiarheiFedartsou 3bc1581
Fix bug in settings
SiarheiFedartsou 705500c
Fix tests
SiarheiFedartsou 707714a
Fix tests
SiarheiFedartsou bb52577
Fix tests
SiarheiFedartsou 72b4e2b
Fix tests
SiarheiFedartsou eefcaaa
feat(schema): add beta ICU tokenizer
missinglink File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,4 @@ | ||
node_modules | ||
npm-debug.log | ||
.DS_Store | ||
config-icu.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
const _ = require('lodash'); | ||
|
||
/** | ||
* This module contains modifications to the Pelias schema to adopt the elastic ICU tokenizer. | ||
* This tokenizer improves word-splitting of non-latin alphabets (particularly Asian languages). | ||
* | ||
* It can be enabled by setting `config.schema.icuTokenizer` in your `pelias.json` config. | ||
* Note: this must be set *before* you create your elasticsearch index or it will have no effect. | ||
* | ||
* This feature is considered beta, we encourage testing & feedback from the community in order | ||
* to adopt the ICU tokenizer as our default. | ||
* | ||
* https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-icu-tokenizer.html | ||
* https://github.com/pelias/schema/pull/498 | ||
*/ | ||
|
||
module.exports = (settings) => { | ||
|
||
// replace pattern tokenizer with icu_tokenizer | ||
_.set(settings, 'analysis.tokenizer.peliasTokenizer', { | ||
'type': 'icu_tokenizer' | ||
}); | ||
|
||
// add ampersand_replacer filter | ||
// replaces ampersand placeholders back to `&` (see `ampersand_mapper` char_filter) | ||
_.set(settings, 'analysis.filter.ampersand_replacer', { | ||
'type': 'pattern_replace', | ||
'pattern': 'AMPERSANDPLACEHOLDER', | ||
'replacement': '&' | ||
}); | ||
|
||
// add ampersand_mapper char_filter | ||
// icu-tokenizer treats ampersands as a word boundary, so we replace them with a placeholder to avoid it, | ||
// as we want to handle them separately, we replace them back after tokenization (see `ampersand_replacer` filter) | ||
_.set(settings, 'analysis.char_filter.ampersand_mapper', { | ||
'type': 'pattern_replace', | ||
'pattern': '&', | ||
'replacement': ' AMPERSANDPLACEHOLDER ' | ||
}); | ||
|
||
// prepend ampersand mapper/replacer to each analyzer | ||
_.forEach(_.get(settings, 'analysis.analyzer'), (block) => { | ||
if (block?.tokenizer !== 'peliasTokenizer') { return; } | ||
block.filter.unshift('ampersand_replacer'); | ||
block.char_filter.unshift('ampersand_mapper'); | ||
}); | ||
|
||
return settings; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
{ | ||
"elasticsearch": { | ||
"settings": { | ||
"index": { | ||
"number_of_replicas": "999", | ||
"number_of_shards": "5", | ||
"refresh_interval": "1m" | ||
} | ||
} | ||
}, | ||
"schema": { | ||
"icuTokenizer": true | ||
} | ||
} | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The correct splitting is:
北京市 - Beijing city
朝阳区 - The district
东三环中路 - East 3rd Ring Middle Road
1号 - Road number
国际大厦 - Building name
a座 - Block number
1001室 - Room number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The full Chinese addresses are usually ...省 ...市 ...区 ...路 ...号 building_name ...座 ...楼 ...室
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, if it’s by tokens then they make sense. I feel single characters like '东', '三', '环', '中路' may be too small for search, but it’s not wrong.