Simplify diacratic removal; modify Latin & Greek preprocessors #724

martholomew · 2024-02-23T20:49:35Z

I added a very simple function to remove diacritics that should also work with the other languages (besides just LA and GRC), but I do not know enough about them to be able to judge that.

Thanks!

StefanVukovic99 · 2024-02-23T21:22:05Z

This looks good, just want to give some context on removing diacritics.

In some languages, diacritics may be optional, or used only in texts for children and learners, dictionaries, where disambiguation is necessary, etc. Lookup needs work on texts with or without them. Instead of trying combinations of diacritics on text without them, it is more efficient to have the diacritic-less variations in the dictionary, and remove them with a textPreprocessor if the text does have them. This may however introduce ambiguity, so maybe the term-reading system could be used, with readings containing the diacritics.

In some cases (e.g. German umlauts) diacritics are not generally omitted, so they can be in the dictionary, and there is no need to remove them.

In the case of Latin, I was under the impression that the long vowel marks and such were optional, but used in some contexts, and so should be removed from the dictionary and in text preprocessing. Unlike me, you seem to actually know something about Latin, so I hope you can verify these assumptions. I'll also probably update the latin diacritic removal at KTY to match what you do here.

martholomew · 2024-02-23T21:36:42Z

I must prefix this by saying that I am a learner of Latin myself, but, as you say, the usage of macrons is not consistent in anything aside from dictionaries and learner texts.

codspeed-hq · 2024-02-24T00:25:44Z

CodSpeed Performance Report

Merging #724 will not alter performance

_{Comparing martholomew:master (18fe417) with master (e47a0f4)}

Summary

✅ 4 untouched benchmarks

github-actions · 2024-02-24T00:26:08Z

⚠️ Visual differences introduced by this PR; please validate if they are desirable.

View Playwright Report (note: open the "playwright-report" artifact)

toasted-nutbread · 2024-02-24T03:48:24Z

Relevant, but I've also ran into problems using string.normalize before since it is somewhat indiscriminate with regard to what it replaces. Can create problems with dictionary forms of words not being searchable.

yomitan/ext/js/dictionary/dictionary-importer.js

Lines 784 to 792 in 9b5de0d

    
           _normalizeTermOrReading(text) { 
        
               // Note: this function should not perform String.normalize on the text, 
        
               // as it will normalize characters in an undesirable way. 
        
               // Thus, this function is currently a no-op. 
        
               // Example: 
        
               // - '\u9038'.normalize('NFC') => '\u9038' (逸) 
        
               // - '\ufa67'.normalize('NFC') => '\u9038' (逸 => 逸) 
        
               return text; 
        
           }

Realistically, there should be some guidelines for dictionary creators for how the data should be formatted or normalized.

martholomew · 2024-02-24T10:48:00Z

Good information for not using it too willy-nilly. I assume that it should perform well on Lain-based languages (and I did not run into any problems myself), and issues related to it would crop up quickly in languages with so few characters.

djahandarie · 2024-02-25T04:28:59Z

@martholomew Could you leave a similar warning on this function as a comment? Otherwise I feel it's quite likely someone might try to use it for a language with a more complicated writing system and introduce a bug by accident.

Signed-off-by: Darius Jahandarie <djahandarie@gmail.com>

ext/js/language/text-preprocessors.js

Signed-off-by: Matttttt <18152455+martholomew@users.noreply.github.com>

Signed-off-by: Darius Jahandarie <djahandarie@gmail.com>

toasted-nutbread · 2024-03-03T15:48:25Z

CI unit test failure:

"ext/js/language/la/latin-text-preprocessors.js",
needs to be ~~added to~~ removed from the eslintrc override with the "serviceworker": true environment.

Signed-off-by: Matttttt <18152455+martholomew@users.noreply.github.com>

djahandarie · 2024-03-18T11:20:38Z

@martholomew Are you able to figure out this TS error in CI or should I try to find someone who can help?

StefanVukovic99 · 2024-03-18T18:52:36Z

Maybe just needs language-descriptors.d.ts to be updated

martholomew · 2024-03-26T20:45:35Z

@martholomew Are you able to figure out this TS error in CI or should I try to find someone who can help?

Sorry for the late response, I don't know how to fix the error, if I could get help that would be great!

StefanVukovic99 · 2024-03-26T21:30:31Z

made a PR https://github.com/martholomew/yomitan/pull/1

fix tests, merge master

Simplified diacratic removal and added preprocessors to LA and GRC

4834f39

martholomew requested a review from a team as a code owner February 23, 2024 20:49

linted

5ea1f2e

martholomew and others added 2 commits February 25, 2024 14:40

Clarified the name of removeAlphabeticDiacritics

b8d0b31

Add comment to removeAlphabeticDiacritics

983b9e9

Signed-off-by: Darius Jahandarie <djahandarie@gmail.com>

djahandarie requested a review from toasted-nutbread February 27, 2024 12:23

Casheeew added the kind/enhancement The issue or PR is a new feature or request label Feb 27, 2024

djahandarie changed the title ~~Simplified diacratic removal and added preprocessors to LA and GRC~~ Simplify diacratic removal; modify Latin & Greek preprocessors Feb 28, 2024

Casheeew reviewed Feb 28, 2024

View reviewed changes

ext/js/language/text-preprocessors.js Outdated Show resolved Hide resolved

Change to NFD

f43c648

Signed-off-by: Matttttt <18152455+martholomew@users.noreply.github.com>

Casheeew previously approved these changes Mar 2, 2024

View reviewed changes

Merge branch 'master' into master

94a4d11

Signed-off-by: Darius Jahandarie <djahandarie@gmail.com>

themoeway-bot previously approved these changes Mar 2, 2024

View reviewed changes

Remove trailing spaces in comment

18fe417

Signed-off-by: Darius Jahandarie <djahandarie@gmail.com>

djahandarie dismissed themoeway-bot’s stale review via 18fe417 March 2, 2024 11:24

themoeway-bot previously approved these changes Mar 2, 2024

View reviewed changes

toasted-nutbread previously approved these changes Mar 2, 2024

View reviewed changes

Merge branch 'master' into master

be1132e

martholomew dismissed Casheeew’s stale review via be1132e March 3, 2024 05:29

Casheeew added the area/linguistics The issue or PR is related to linguistics label Mar 4, 2024

Remove latin preprocessors .eslintrc.json

e0de851

Signed-off-by: Matttttt <18152455+martholomew@users.noreply.github.com>

martholomew dismissed themoeway-bot’s stale review via e0de851 March 5, 2024 21:46

martholomew dismissed toasted-nutbread’s stale review via e0de851 March 5, 2024 21:46

Merge branch 'master' into master

d96d0fb

StefanVukovic99 added 2 commits March 26, 2024 22:26

fix tests

4b1eed9

Merge branch 'master' into martholow-latin

368c67e

Merge pull request #1 from StefanVukovic99/martholow-latin

9d1712a

fix tests, merge master

jamesmaa enabled auto-merge April 8, 2024 18:29

jamesmaa approved these changes Apr 8, 2024

View reviewed changes

jamesmaa disabled auto-merge April 8, 2024 18:47

jamesmaa added this pull request to the merge queue Apr 8, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 8, 2024

jamesmaa added this pull request to the merge queue Apr 8, 2024

Merged via the queue into themoeway:master with commit 0663774 Apr 8, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify diacratic removal; modify Latin & Greek preprocessors #724

Simplify diacratic removal; modify Latin & Greek preprocessors #724

martholomew commented Feb 23, 2024

StefanVukovic99 commented Feb 23, 2024

martholomew commented Feb 23, 2024

codspeed-hq bot commented Feb 24, 2024 •

edited

Loading

github-actions bot commented Feb 24, 2024 •

edited

Loading

toasted-nutbread commented Feb 24, 2024

martholomew commented Feb 24, 2024

djahandarie commented Feb 25, 2024

toasted-nutbread commented Mar 3, 2024 •

edited

Loading

djahandarie commented Mar 18, 2024

StefanVukovic99 commented Mar 18, 2024

martholomew commented Mar 26, 2024

StefanVukovic99 commented Mar 26, 2024

Simplify diacratic removal; modify Latin & Greek preprocessors #724

Simplify diacratic removal; modify Latin & Greek preprocessors #724

Conversation

martholomew commented Feb 23, 2024

StefanVukovic99 commented Feb 23, 2024

martholomew commented Feb 23, 2024

codspeed-hq bot commented Feb 24, 2024 • edited Loading

CodSpeed Performance Report

Merging #724 will not alter performance

Summary

github-actions bot commented Feb 24, 2024 • edited Loading

toasted-nutbread commented Feb 24, 2024

martholomew commented Feb 24, 2024

djahandarie commented Feb 25, 2024

toasted-nutbread commented Mar 3, 2024 • edited Loading

djahandarie commented Mar 18, 2024

StefanVukovic99 commented Mar 18, 2024

martholomew commented Mar 26, 2024

StefanVukovic99 commented Mar 26, 2024

codspeed-hq bot commented Feb 24, 2024 •

edited

Loading

github-actions bot commented Feb 24, 2024 •

edited

Loading

toasted-nutbread commented Mar 3, 2024 •

edited

Loading