Use ICU tokenizer to improve some Asian languages support #498

SiarheiFedartsou · 2024-11-26T17:53:19Z

👋

Disclaimer: I am not native to Thai, Chinese, Japanese or any languages I am adding support for here, but this change still seems to be reasonable.

Here's the reason for this change 🚀

At the moment we use simple regex based tokenizer which simply splits strings using whitespaces and dashes as separator ([\s,/\\-]+). While it perfectly works for most of Western languages it fails in some Asian ones. Mainly it is because they don't actually use spaces to separate words (!). This change should address this via using icu_tokenizer which uses different complicated technics to solve it.

Worth saying that this solution is still not perfect, e.g. for Thai there is separate thai_tokenizer (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-thai-tokenizer.html), but icu one still works better than what we are using right now...

Here's what actually got changed 👏

Changed tokenizer to icu_tokenizer
Added workaround for ampersand - icu_tokenizer filters it out...

Here's how others can test the changes 👀

I added a couple of integration tests with examples of queries which were tokenized as single token in the past.

SiarheiFedartsou · 2024-11-26T17:56:49Z

@missinglink please take a look when you'll have a chance 🙏🏻
Also do you think there are other places where we would need to make similar changes to have better support for such languages?

missinglink

This would be an interesting change, I'd like to continue investigation.

It will need significant testing since this is an absolutely core function of the system, even small regressions here will be hugely impactful.

It's promising that the integration tests all still pass, pointing to a possibility that this can be implemented in a backwards-compatible way.

I've added a few comments in the review.

Mostly I'd like to better understand the differences between the two tokenizers, any potential regressions/changes to the existing method, and enumerate the benefits of this work.

As it's a major change, it would be great to list at least a few queries which would improve with this change so other developers understand the value.

integration/analyzer_peliasStreet.js

missinglink · 2024-11-27T11:14:00Z

integration/analyzer_peliasQuery.js

@@ -49,6 +49,16 @@ module.exports.tests.functional = function(test, common){
    assertAnalysis( 'place', 'Toys "R" Us!', [ 'toys', 'r', 'us' ]);
    assertAnalysis( 'address', '101 mapzen place', [ '101', 'mapzen', 'place' ]);

+     // complicated tokenization for some Asian languages
+     assertAnalysis('thai_address1', 'ซอยเพชรบุรี๑', ['ซอย', 'เพชรบุรี1'] );


Can you explain a little more about the segmentation here?

Are these token groups from research/knowledge or just a copy->paste of what was generated?

The number 1 on the end catches my eye and I'm curious how it appeared from ซอยเพชรบุรี๑

'ซอย', 'เพชรบุรี1'

Are these token groups from research/knowledge or just a copy->paste of what was generated?

Well, I used ChatGPT to generate them and then tools like Google Translate to validate, that these tokens indeed have separate meanings... Tbh it would be great to have someone native to one of those languages to validate 🤔 I will try to find one 😄

The number 1 on the end catches my eye and I'm curious how it appeared from ซอยเพชรบุรี๑

Thai language has its own digits (but they use Arabic ones as well btw) and ๑ is 1. Btw we convert them already even without this tokenizer. Not sure at which step though. But good question is if this number should be a separate token, but IIRC it is how it works right now(if you input foo1 it will generate single foo1 token), so I decided to limit scope of changes and may be will address it with separate PR. WDYT?

and ๑ is 1.

This makes sense, it's not an issue, normalizing to Arabic numerals is fine.

and yeah, we should get a human to check a few examples, maybe best to generate a few more before asking someone in different classes such as address, venue(s), streets, intersections, localities etc.

this feature will go through a bunch of testing which might take a long time, flagging potential issues now will be much easier than finding them later, I'm fairly confident we can resolve any issues.

missinglink · 2024-11-27T11:17:02Z

settings.js

@@ -22,16 +22,16 @@ function generate(){
    "analysis": {
      "tokenizer": {
        "peliasTokenizer": {
-          "type": "pattern",
-          "pattern": "[\\s,/\\\\-]+"


The previous pattern matched on [whitespace, comma, forward/bash slashes and hyphen], is this still the case with the icu_tokenizer?

Good point, I'll try to check if we have relevant tests and if we don't, will write them 👍

missinglink · 2024-11-27T11:18:34Z

settings.js

@@ -175,6 +181,12 @@ function generate(){
        },
      },
      "filter" : {


I wish this AMPERSANDPLACEHOLDER replacement wasn't necessary but understand why it is.

missinglink · 2024-11-27T11:29:38Z

Possibly stupid question, but does Elastic support the regex word boundary character \b and is that unicode-aware?

SiarheiFedartsou · 2024-11-27T12:15:07Z

@missinglink thanks for the review! I'll come with more answers to your questions later.

kenil-cheng · 2024-11-27T16:06:03Z

integration/analyzer_peliasQuery.js

+     assertAnalysis('thai_address2', 'ซอยเพชรบุรี๑foo', ['ซอย', 'เพชรบุรี1', 'foo'] );
+     assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]);
+     assertAnalysis('chinese_address', '北京市朝阳区东三环中路1号国际大厦A座1001室', 
+       ['北京市', '朝阳', '区', '东', '三', '环', '中路', '1', '号', '国际', '大厦', 'a', '座', '1001', '室']);


The correct splitting is:
北京市 - Beijing city
朝阳区 - The district
东三环中路 - East 3rd Ring Middle Road
1号 - Road number
国际大厦 - Building name
a座 - Block number
1001室 - Room number

The full Chinese addresses are usually ...省 ...市 ...区 ...路 ...号 building_name ...座 ...楼 ...室

BTW, if it’s by tokens then they make sense. I feel single characters like '东', '三', '环', '中路' may be too small for search, but it’s not wrong.

kenil-cheng · 2024-11-27T16:10:10Z

integration/analyzer_peliasQuery.js

+     assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]);
+     assertAnalysis('chinese_address', '北京市朝阳区东三环中路1号国际大厦A座1001室', 
+       ['北京市', '朝阳', '区', '东', '三', '环', '中路', '1', '号', '国际', '大厦', 'a', '座', '1001', '室']);
+     assertAnalysis('japanese_address', '東京都渋谷区渋谷２丁目２１−１渋谷スクランブルスクエア４階', ["東京", "都", "渋谷", "区", "渋谷", "2", "丁目", "21", "1", "渋谷", "スクランフル", "スクエア", "4", "階"]);


The correct splitting is:
東京都 - Tokyo city
渋谷区 - District
渋谷2 丁目 - Street
211 - Street number
渋谷スクランフルスクエア - Building: Shibuya Scramble Square
4 階 - Floor

missinglink · 2024-11-27T16:45:15Z

Thank you @kenil-cheng for your time.

SiarheiFedartsou · 2024-11-27T17:07:05Z

Possibly stupid question, but does Elastic support the regex word boundary character \b and is that unicode-aware?

I've tried to simply extend original regex with it and got error (smth about it doesn't know what \b is) 😞

dzviagin · 2024-11-28T08:06:33Z

integration/analyzer_peliasStreet.js

+    // complicated tokenization for some Asian languages
+    assertAnalysis('thai_address1', 'ซอยเพชรบุรี๑', ['ซอย', 'เพชรบุรี1'] );
+    assertAnalysis('thai_address2', 'ซอยเพชรบุรี๑foo', ['ซอย', 'เพชรบุรี1', 'foo'] );
+    assertAnalysis('thai_address3', 'บ้านเลขที่๑๒๓ถนนสุขุมวิทแขวงคลองตันเหนือเขตวัฒนา กรุงเทพมหานคร๑๐๑๑๐', ["บาน", "เลข", "ที123ถนน", "สุขุมวิท", "แขวง", "คลองตัน", "เหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร10110"]);


My Thai mate suggested

["บ้าน", "เลขที่", "123", "ถนน", "สุขุมวิท", "แขวง", "คลองตันเหนือ", "เขต", "วัฒนา", "กรุงเทพมหานคร", "10110",]

apart from different split, it's visible that first letter was changed from บ้ to บ in your case.

Thanks @dzviagin ! I will take a look, likely it is a result of removal of diacritic symbols, which existed even before this PR...

SiarheiFedartsou · 2025-02-01T09:37:56Z

Hey @missinglink WDYT we would need to do in order to eventually get it merged?

missinglink · 2025-02-01T12:17:19Z

Hi @SiarheiFedartsou

This is a significant change to probably the most core function in the system, which is always going to be a difficult thing to merge confidentiality.

I see a few paths to adoption:

Merge it as-is and hope nothing breaks.
Invest some significant time testing it against global datasets and queries, then merge it.
Merge it behind a beta flag, encourage adoption and iterate based on feedback to a stable release.
Keep it in a fork and encourage adoption, as above.

We need to be very careful to consider all the current users across business/local govt etc who will be impacted by the change (positively or negatively).

For this reason I feel like 1. is potentially a bit reckless, maybe you can assure me it's not going to affect existing Latin queries?

Option 2. is a good idea but requires potentially a lot of time invested by the core team to spin up global builds and test them, a build generally takes most of a day, testing also takes time, iterating on this process takes a lot of time.

Option 3. is the easiest solution to getting the code into the master branch, this would allow users to test it out and provide feedback. The issue being that it's going to be enabled by a small percentage of users, although it would also make 2. easier for us to automate. The code would potentially be messier during the beta period (code duplication).

One clean way of achieving this is to have a new pelias/config variable which is read when the index is created, call it icuTokenizer: boolean or similar which tells this lib to use the icu tokenizer code path.

Option 4. is much like 3 except it's even less likely that users will find the fork and use it as a replacement for the master branch, reducing the amount of feedback we receive.

It seems to me that 3. would be something we could accept immediately and would make 2. easier to automate.

Please let me know your thoughts and if I've missed something, or if you think I'm overstating any potential regressions.

cc/ @orangejulius @Joxit for their opinions

SiarheiFedartsou · 2025-02-01T13:01:21Z

@missinglink yes, I agree. While I am quite sure it won't affect existing Latin queries from business logic perspective, I am not sure in other perspectives: e.g. I would expect this tokenizer to be slower and it is not clear for how much.

Okay, I'll probably try to invest some time to investigate how this change can be made optional(i.e. option 3). Thanks!

missinglink · 2025-02-01T15:30:35Z

Regarding the performance impact:

I suspect that the indexing performance to be negligible yet considering the scale of 500+ million documents it could become an issue, let's keep an eye on this but it's likely not a big deal to add 1-2% to build times considering the benefits.

For queries (where we also run the tokenizer) I also suspect it will be negligible, but yeah let's keep an eye on that too, particularly for autocomplete which is latency sensitive.

We (Geocode Earth) have the ability to split test our deployments and measure latency differences between API traffic sent to each. It's something we'd do once we were sure the quality wasn't affected but would allow us to scientifically measure the latency impact on the scale of tens of millions of queries.

SiarheiFedartsou · 2025-02-02T10:29:07Z

@missinglink may I ask you to take a look again please? :)

What I have now:

I added icuTokenizer flag to config (I assume it is okay to simply add it to configValidation and don't update https://github.com/pelias/config ?)
I rely on this flag in settings.js to generate proper schema. I tried to avoid copy-paste as much as possible.
I extended CI, so that we run tests for both icuTokenizer = true/false
I updated tests, so that for some tests we have separate branches for icuTokenizer = true/false

missinglink · 2025-02-03T14:17:06Z

Looks good to me:

/code/p/schema pull/498/merge = *1 ❯ 2>&1 npm run integration | grep -i thai
    ✔ thai_address
    ✔ thai_address1
    ✔ thai_address1

/code/p/schema pull/498/merge = *1 ❯ jq -n '{ schema: { icuTokenizer: true } }' > $(pwd)/config-icu.json
/code/p/schema pull/498/merge = *1 ?1 ❯ export PELIAS_CONFIG=$(pwd)/config-icu.json
/code/p/schema pull/498/merge = *1 ?1 ❯ 2>&1 npm run integration | grep -i thai
    ✔ thai_address
    ✔ thai_address1
    ✔ thai_address2
    ✔ thai_address3
    ✔ thai_address1
    ✔ thai_address2
    ✔ thai_address3

missinglink · 2025-02-03T14:18:04Z

Does this PR include #501 or do we need to merge that seperately?

SiarheiFedartsou · 2025-02-03T14:37:51Z

Does this PR include #501 or do we need to merge that seperately?

It doesn't. Let's may be merge it separately?

missinglink · 2025-02-03T14:38:08Z

I really wish we didn't need ampersand_mapper and ampersand_replacer, removing them would make this much cleaner, unfortunately I don't see any way of achieving this 🤷

I'm happy to merge this tomorrow unless we get any additional feedback from the linked contributors before then.

missinglink · 2025-02-04T11:57:25Z

Thanks for contributing this.
I added eefcaaa to do a little cleanup before merging 🙇

orangejulius · 2025-02-06T17:51:22Z

Hey sorry, I'm late to this PR but I'd like to request some changes.

We've extended considerable effort to ensure that the Github actions templates across all our Pelias repos are as identical as possible. The schema repository is one that isn't quite identical yet, but would it be possible to adjust the testing done in this PR so it does not work against that? What that means is we keep any logic of how to run the tests and what that means out of the GitHub Actions templates and inside whatever is run when we call npm run tests.

I'm also a little concerned about having completely duplicate fixtures and expected test values. Those file are already a bit of a pain to keep up to date, and now we'll have two of them.

Do we think that having multiple tokenizers is something we would want to have in the Pelias codebase long term? If not, let's just keep this on a branch and test it until it's better than what we currently have?

SiarheiFedartsou force-pushed the sf-icu-tokenizer3 branch from e97e8e9 to 8c2712b Compare November 26, 2024 17:55

SiarheiFedartsou marked this pull request as ready for review November 26, 2024 17:55

missinglink reviewed Nov 27, 2024

View reviewed changes

kenil-cheng reviewed Nov 27, 2024

View reviewed changes

dzviagin reviewed Nov 28, 2024

View reviewed changes

SiarheiFedartsou mentioned this pull request Dec 7, 2024

Add some more test cases for tokenization and ascii folding #501

Merged

SiarheiFedartsou added 7 commits February 2, 2025 11:07

Use ICU tokenizer to improve some Asian languages support

5074b6b

Remove unused import

a91418e

Add more chinese test cases

6e8feea

Add icuTokenizer flag

7902972

Implement ICU tokenizer test

f8f844c

Run unit tests for both ICU = true/false

0275458

Run tests for both ICU = true/false

1384119

SiarheiFedartsou force-pushed the sf-icu-tokenizer3 branch from b1bc083 to 1384119 Compare February 2, 2025 10:08

SiarheiFedartsou added 6 commits February 2, 2025 11:09

add fixtures

cc3765f

Fix bug in settings

3bc1581

Fix tests

705500c

Fix tests

707714a

Fix tests

bb52577

Fix tests

72b4e2b

SiarheiFedartsou requested a review from missinglink February 2, 2025 10:29

missinglink force-pushed the sf-icu-tokenizer3 branch 2 times, most recently from 80d483c to 54f9b5b Compare February 4, 2025 11:51

feat(schema): add beta ICU tokenizer

eefcaaa

missinglink force-pushed the sf-icu-tokenizer3 branch from 54f9b5b to eefcaaa Compare February 4, 2025 11:53

missinglink merged commit 1098354 into pelias:master Feb 4, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ICU tokenizer to improve some Asian languages support #498

Use ICU tokenizer to improve some Asian languages support #498

SiarheiFedartsou commented Nov 26, 2024 •

edited

Loading

SiarheiFedartsou commented Nov 26, 2024

missinglink left a comment

missinglink Nov 27, 2024

SiarheiFedartsou Nov 27, 2024

missinglink Nov 27, 2024

missinglink Nov 27, 2024 •

edited

Loading

missinglink Nov 27, 2024 •

edited

Loading

missinglink Nov 27, 2024

SiarheiFedartsou Nov 27, 2024

missinglink Nov 27, 2024

missinglink commented Nov 27, 2024 •

edited

Loading

SiarheiFedartsou commented Nov 27, 2024

kenil-cheng Nov 27, 2024 •

edited

Loading

kenil-cheng Nov 27, 2024

kenil-cheng Nov 27, 2024

kenil-cheng Nov 27, 2024

missinglink commented Nov 27, 2024

SiarheiFedartsou commented Nov 27, 2024 •

edited

Loading

dzviagin Nov 28, 2024 •

edited

Loading

SiarheiFedartsou Nov 28, 2024

SiarheiFedartsou commented Feb 1, 2025

missinglink commented Feb 1, 2025

SiarheiFedartsou commented Feb 1, 2025 •

edited

Loading

missinglink commented Feb 1, 2025

SiarheiFedartsou commented Feb 2, 2025 •

edited

Loading

missinglink commented Feb 3, 2025 •

edited

Loading

missinglink commented Feb 3, 2025

SiarheiFedartsou commented Feb 3, 2025

missinglink commented Feb 3, 2025

missinglink commented Feb 4, 2025

orangejulius commented Feb 6, 2025 •

edited

Loading

Use ICU tokenizer to improve some Asian languages support #498

Use ICU tokenizer to improve some Asian languages support #498

Conversation

SiarheiFedartsou commented Nov 26, 2024 • edited Loading

Here's the reason for this change 🚀

Here's what actually got changed 👏

Here's how others can test the changes 👀

SiarheiFedartsou commented Nov 26, 2024

missinglink left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

missinglink Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

missinglink Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

missinglink commented Nov 27, 2024 • edited Loading

SiarheiFedartsou commented Nov 27, 2024

kenil-cheng Nov 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

missinglink commented Nov 27, 2024

SiarheiFedartsou commented Nov 27, 2024 • edited Loading

dzviagin Nov 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SiarheiFedartsou commented Feb 1, 2025

missinglink commented Feb 1, 2025

SiarheiFedartsou commented Feb 1, 2025 • edited Loading

missinglink commented Feb 1, 2025

SiarheiFedartsou commented Feb 2, 2025 • edited Loading

missinglink commented Feb 3, 2025 • edited Loading

missinglink commented Feb 3, 2025

SiarheiFedartsou commented Feb 3, 2025

missinglink commented Feb 3, 2025

missinglink commented Feb 4, 2025

orangejulius commented Feb 6, 2025 • edited Loading

SiarheiFedartsou commented Nov 26, 2024 •

edited

Loading

missinglink Nov 27, 2024 •

edited

Loading

missinglink Nov 27, 2024 •

edited

Loading

missinglink commented Nov 27, 2024 •

edited

Loading

kenil-cheng Nov 27, 2024 •

edited

Loading

SiarheiFedartsou commented Nov 27, 2024 •

edited

Loading

dzviagin Nov 28, 2024 •

edited

Loading

SiarheiFedartsou commented Feb 1, 2025 •

edited

Loading

SiarheiFedartsou commented Feb 2, 2025 •

edited

Loading

missinglink commented Feb 3, 2025 •

edited

Loading

orangejulius commented Feb 6, 2025 •

edited

Loading