Add some more test cases for tokenization and ascii folding #501

SiarheiFedartsou · 2024-12-07T12:58:25Z

👋 I did some awesome work for the Pelias project and would love for everyone to have a look at it and provide feedback.

Here's the reason for this change 🚀

There are some questions in discussion of this PR #498, so I'd like to propose to extend a bit existing test harness to kind of document current state of the things. So current schema:

tokenizes by whitespace, hyphen, slashes (but doesn't by "dash" character for example - I'd expect ICU tokenizer to work differently here btw 🤔 )
normalizes thai digits to Arabic ones (I think we can extrapolate it to any other digits writing system)
removes tonal marks in Thai script
we never make digits "glued" to the end of word a separate token
we don't tokenize Asian languages which don't use whitespaces properly

Here's what actually got changed 👏

Added tests

Here's how others can test the changes 👀

Run tests :)

orangejulius

Looks good to me, these will be helpful tests to have as we take a look at #498

missinglink · 2025-02-04T12:03:01Z

Hi @SiarheiFedartsou I just rebased origin/master after merging your other PR and these cases now fail due to differences between the two tokenizers.

Would you mind updating it to reflect the behaviour you were expecting from the icuTokenizer 🙏

missinglink · 2025-02-04T12:09:50Z

Are these failures expected?

✖ thai_tonemarks
-----------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ 'กกกกขขขขคคคคฆฆฆฆ' ] }
  actual: |-
    {}

✖ chinese_address
------------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ '北京市朝阳区东三环中路1号国际大厦a座1001室' ] }
  actual: |-
    {}

SiarheiFedartsou · 2025-02-04T12:31:31Z

Are these failures expected?

✖ thai_tonemarks
-----------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ 'กกกกขขขขคคคคฆฆฆฆ' ] }
  actual: |-
    {}

✖ chinese_address
------------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ '北京市朝阳区东三环中路1号国际大厦a座1001室' ] }
  actual: |-
    {}

I doubt it :) I will look into it nearest days. Most likely we should have different expected results for ICU and no-ICU...

SiarheiFedartsou · 2025-02-06T17:39:15Z

@missinglink done. PTAL again.

missinglink mentioned this pull request Feb 3, 2025

Use ICU tokenizer to improve some Asian languages support #498

Merged

orangejulius approved these changes Feb 3, 2025

View reviewed changes

missinglink force-pushed the sf-add-tests branch from f038644 to 964cc43 Compare February 4, 2025 11:58

SiarheiFedartsou added 3 commits February 6, 2025 18:13

Add some more test cases for tokenization and ascii folding

8fba241

Add some more test cases for tokenization and ascii folding

dd8c760

Fix failing test cases

133e17b

SiarheiFedartsou force-pushed the sf-add-tests branch from 964cc43 to 133e17b Compare February 6, 2025 17:27

missinglink merged commit 1319c20 into pelias:master Feb 7, 2025
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add some more test cases for tokenization and ascii folding #501

Add some more test cases for tokenization and ascii folding #501

SiarheiFedartsou commented Dec 7, 2024 •

edited

Loading

orangejulius left a comment

missinglink commented Feb 4, 2025

missinglink commented Feb 4, 2025

SiarheiFedartsou commented Feb 4, 2025

SiarheiFedartsou commented Feb 6, 2025

Add some more test cases for tokenization and ascii folding #501

Add some more test cases for tokenization and ascii folding #501

Conversation

SiarheiFedartsou commented Dec 7, 2024 • edited Loading

Here's the reason for this change 🚀

Here's what actually got changed 👏

Here's how others can test the changes 👀

orangejulius left a comment

Choose a reason for hiding this comment

missinglink commented Feb 4, 2025

missinglink commented Feb 4, 2025

SiarheiFedartsou commented Feb 4, 2025

SiarheiFedartsou commented Feb 6, 2025

SiarheiFedartsou commented Dec 7, 2024 •

edited

Loading