Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some more test cases for tokenization and ascii folding #501

Merged
merged 3 commits into from
Feb 7, 2025

Conversation

SiarheiFedartsou
Copy link
Contributor

@SiarheiFedartsou SiarheiFedartsou commented Dec 7, 2024

👋 I did some awesome work for the Pelias project and would love for everyone to have a look at it and provide feedback.


Here's the reason for this change 🚀

There are some questions in discussion of this PR #498, so I'd like to propose to extend a bit existing test harness to kind of document current state of the things. So current schema:

  • tokenizes by whitespace, hyphen, slashes (but doesn't by "dash" character for example - I'd expect ICU tokenizer to work differently here btw 🤔 )
  • normalizes thai digits to Arabic ones (I think we can extrapolate it to any other digits writing system)
  • removes tonal marks in Thai script
  • we never make digits "glued" to the end of word a separate token
  • we don't tokenize Asian languages which don't use whitespaces properly

Here's what actually got changed 👏

  • Added tests

Here's how others can test the changes 👀

Run tests :)

Copy link
Member

@orangejulius orangejulius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, these will be helpful tests to have as we take a look at #498

@missinglink
Copy link
Member

Hi @SiarheiFedartsou I just rebased origin/master after merging your other PR and these cases now fail due to differences between the two tokenizers.

Would you mind updating it to reflect the behaviour you were expecting from the icuTokenizer 🙏

@missinglink
Copy link
Member

Are these failures expected?

✖ thai_tonemarks
-----------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ 'กกกกขขขขคคคคฆฆฆฆ' ] }
  actual: |-
    {}
✖ chinese_address
------------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ '北京市朝阳区东三环中路1号国际大厦a座1001室' ] }
  actual: |-
    {}

@SiarheiFedartsou
Copy link
Contributor Author

Are these failures expected?

✖ thai_tonemarks
-----------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ 'กกกกขขขขคคคคฆฆฆฆ' ] }
  actual: |-
    {}
✖ chinese_address
------------------
  operator: deepEqual
  expected: |-
    { '@pos0': [ '北京市朝阳区东三环中路1号国际大厦a座1001室' ] }
  actual: |-
    {}

I doubt it :) I will look into it nearest days. Most likely we should have different expected results for ICU and no-ICU...

@SiarheiFedartsou
Copy link
Contributor Author

@missinglink done. PTAL again.

@missinglink missinglink merged commit 1319c20 into pelias:master Feb 7, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants