Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify diacratic removal; modify Latin & Greek preprocessors #724

Merged
merged 13 commits into from
Apr 8, 2024
Merged

Simplify diacratic removal; modify Latin & Greek preprocessors #724

merged 13 commits into from
Apr 8, 2024

Conversation

martholomew
Copy link

I added a very simple function to remove diacritics that should also work with the other languages (besides just LA and GRC), but I do not know enough about them to be able to judge that.

Thanks!

@martholomew martholomew requested a review from a team as a code owner February 23, 2024 20:49
@StefanVukovic99
Copy link
Member

This looks good, just want to give some context on removing diacritics.

In some languages, diacritics may be optional, or used only in texts for children and learners, dictionaries, where disambiguation is necessary, etc. Lookup needs work on texts with or without them. Instead of trying combinations of diacritics on text without them, it is more efficient to have the diacritic-less variations in the dictionary, and remove them with a textPreprocessor if the text does have them. This may however introduce ambiguity, so maybe the term-reading system could be used, with readings containing the diacritics.

In some cases (e.g. German umlauts) diacritics are not generally omitted, so they can be in the dictionary, and there is no need to remove them.

In the case of Latin, I was under the impression that the long vowel marks and such were optional, but used in some contexts, and so should be removed from the dictionary and in text preprocessing. Unlike me, you seem to actually know something about Latin, so I hope you can verify these assumptions. I'll also probably update the latin diacritic removal at KTY to match what you do here.

@martholomew
Copy link
Author

I must prefix this by saying that I am a learner of Latin myself, but, as you say, the usage of macrons is not consistent in anything aside from dictionaries and learner texts.

Copy link

codspeed-hq bot commented Feb 24, 2024

CodSpeed Performance Report

Merging #724 will not alter performance

Comparing martholomew:master (18fe417) with master (e47a0f4)

Summary

✅ 4 untouched benchmarks

Copy link

github-actions bot commented Feb 24, 2024

⚠️ Visual differences introduced by this PR; please validate if they are desirable.

View Playwright Report (note: open the "playwright-report" artifact)

@toasted-nutbread
Copy link

Relevant, but I've also ran into problems using string.normalize before since it is somewhat indiscriminate with regard to what it replaces. Can create problems with dictionary forms of words not being searchable.

_normalizeTermOrReading(text) {
// Note: this function should not perform String.normalize on the text,
// as it will normalize characters in an undesirable way.
// Thus, this function is currently a no-op.
// Example:
// - '\u9038'.normalize('NFC') => '\u9038' (逸)
// - '\ufa67'.normalize('NFC') => '\u9038' (逸 => 逸)
return text;
}

Realistically, there should be some guidelines for dictionary creators for how the data should be formatted or normalized.

@martholomew
Copy link
Author

Good information for not using it too willy-nilly. I assume that it should perform well on Lain-based languages (and I did not run into any problems myself), and issues related to it would crop up quickly in languages with so few characters.

@djahandarie
Copy link
Collaborator

@martholomew Could you leave a similar warning on this function as a comment? Otherwise I feel it's quite likely someone might try to use it for a language with a more complicated writing system and introduce a bug by accident.

@Casheeew Casheeew added the kind/enhancement The issue or PR is a new feature or request label Feb 27, 2024
@djahandarie djahandarie changed the title Simplified diacratic removal and added preprocessors to LA and GRC Simplify diacratic removal; modify Latin & Greek preprocessors Feb 28, 2024
Signed-off-by: Matttttt <18152455+martholomew@users.noreply.github.com>
Casheeew
Casheeew previously approved these changes Mar 2, 2024
Signed-off-by: Darius Jahandarie <djahandarie@gmail.com>
themoeway-bot
themoeway-bot previously approved these changes Mar 2, 2024
Signed-off-by: Darius Jahandarie <djahandarie@gmail.com>
themoeway-bot
themoeway-bot previously approved these changes Mar 2, 2024
@toasted-nutbread
Copy link

toasted-nutbread commented Mar 3, 2024

CI unit test failure:

"ext/js/language/la/latin-text-preprocessors.js",
needs to be added to removed from the eslintrc override with the "serviceworker": true environment.

@Casheeew Casheeew added the area/linguistics The issue or PR is related to linguistics label Mar 4, 2024
Signed-off-by: Matttttt <18152455+martholomew@users.noreply.github.com>
@djahandarie
Copy link
Collaborator

@martholomew Are you able to figure out this TS error in CI or should I try to find someone who can help?

@StefanVukovic99
Copy link
Member

Maybe just needs language-descriptors.d.ts to be updated

@martholomew
Copy link
Author

@martholomew Are you able to figure out this TS error in CI or should I try to find someone who can help?

Sorry for the late response, I don't know how to fix the error, if I could get help that would be great!

@StefanVukovic99
Copy link
Member

made a PR https://github.com/martholomew/yomitan/pull/1

@jamesmaa jamesmaa enabled auto-merge April 8, 2024 18:29
@jamesmaa jamesmaa disabled auto-merge April 8, 2024 18:47
@jamesmaa jamesmaa added this pull request to the merge queue Apr 8, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 8, 2024
@jamesmaa jamesmaa added this pull request to the merge queue Apr 8, 2024
Merged via the queue into themoeway:master with commit 0663774 Apr 8, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/linguistics The issue or PR is related to linguistics kind/enhancement The issue or PR is a new feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants