Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

add korean sentence validation & cleanup #630

Merged
merged 1 commit into from
Aug 8, 2022
Merged

add korean sentence validation & cleanup #630

merged 1 commit into from
Aug 8, 2022

Conversation

sftblw
Copy link
Contributor

@sftblw sftblw commented Jul 30, 2022

This adds korean validation and cleanup.

  • Korean script should only allow korean letters and some special characters [.,?!] (in my opinion).
    • There are so many things to be not allowd, so I added the rule as whitelist rule.
    • For responsive message I added some rules before it, which may be common mistakes.
  • For sentence length calculation, I replaced it with character length.
    • In my opinion, Tokenization is somewhat heavyweight for this task, especially in CJK (Chinese, Japanese, Korean).
    • max length 50 is somewhat random, but the famous pengram "키스의 고유조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다" is 35 length so... It would be sufficient.
  • About normalization
    • Korean input system has (mainly) two kind of characters (in unicode space), which are interchangeable but have different unicode code point.
    • In my opinion both should be allowed to be submitted, but normalized into one kind of codepoint when It comes to actual usage or sentence storing in DB.
    • so I added normalization to both of validation and cleanup.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing these. I have added a few comments.

server/lib/cleanup/languages/ko.js Outdated Show resolved Hide resolved
server/lib/validation/languages/ko.js Show resolved Hide resolved
server/lib/validation/languages/ko.js Outdated Show resolved Hide resolved
}, {
regex: /[a-zA-Z]/,
error: 'Sentence should not contain latin alphabet characters',
}, {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the three rules here can be removed, as the last rule in this list would catch those anyway. What do you think?

server/lib/validation/languages/ko.js Outdated Show resolved Hide resolved
server/lib/validation/languages/ko.js Outdated Show resolved Hide resolved
server/lib/validation/languages/ko.js Outdated Show resolved Hide resolved
@sftblw
Copy link
Contributor Author

sftblw commented Aug 1, 2022

Thank you for feedback.

About normalization

Normalization was intended for contributors who use certain kind of keyboard layout, called "Sebeolsik".

Korean letters (Hangul) in unicode has two kind of codepoints.

  • Composed form (Unicode "Hangul Syllables" : U+AC00–U+D7A3)
    • One Unicode codepoint contains three or two letters in rectangular shape.
    • This codepoints are normally used.
    • ex) col is which is U+CF5C, taking only one codepoint.
  • Other forms
    • Other Unicode codepoints deal korean letters as separated vowels and consonants.
    • This takes doubled space in bytes.
    • ex) col is 콜 and it is actually three codepoints, (1st k) (2nd oh) (3rd r).
    • This only appears when a contributor is using keyboard layout called "Sebeolsik", which is akin to Dvorak, or It appears when it comes to filename of macOS, since macOS normalize filenames with decomposition normalization (NFD).
  • Two kinds of codepoint can be converted by each with normalization.
    • "콜".normalize('NFD').split('') is Array(3) [ "ᄏ", "ᅩ", "ᆯ" ].
      • and "콜".split('') is Array [ "콜" ], proving its one and only codepoint.
    • [ "ᄏ", "ᅩ", "ᆯ" ].join('').normalize('NFC').split('') is Array [ "콜" ]

It comes clear if the sentence is normalized with composition normalization (NFC) before the sentence is submitted to the database, but I couldn't find existing way to normalize(morph) the sentence before submitted. Cleanup script is added to complement this in other way while supporting minor cases (And I think that is not clear way to do, too.)

So, It might be handled with one of two ways:

  1. Simply prevent contributors to input decomposed code points (only allowing "Hangul Syllables"), Since this make things clear and those kind of codepoints are quite rarely used (in my opinion).
  2. Add normalization before submitting, changing original sentence (Since NFC is non-destructive normalization, it is perfectly fine to do).

(and /[^가-힣.,?! ]/u only allows composed form, It's my mistake.)

Character length

I checked one of "Public for Korean" dataset from aihub.or.kr, - validation set, random 10 sentences (without actually listening them, believing its metadata to be correct)

validation - broadcast set

data no. len second
14 121 15.87
29 38 3.84
43 37 4.48
61 37 3.46
458 97 11.01
466 61 7.3
475 73 10.37
738 44 5.5
826 45 5.12
993 52 5.63

( 605 characters / 72.58 second) = 8.33563 character / second

@MichaelKohler
Copy link
Member

Thanks for the explanation. I'm now checking if we could instead add NFC somewhere else (even before validation) to make this simpler. I will report back.

@HarikalarKutusu
Copy link
Contributor

HarikalarKutusu commented Aug 3, 2022

max length 50 is somewhat random, but the famous pengram "키스의 고유조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다" is 35 length so... It would be sufficient.

8.33563 character / second

Wouldn't it better to increase the max sentence length? Something like 70?
Common Voice recordings are limited to 10 sec recordings, mainly optimized for batch processing in 8GB GPU's.
If you limit your max-length further, you may be forced to exclude many sentences, which are hard to come by.
I'm asking without knowing the language of course...

@MichaelKohler
Copy link
Member

I have checked with others and per-language NFC normalization is something we want to support. Therefore I implemented that capability here: 5a86a81 . I've already enabled it for Korean, however it's not on the live website yet. I will create a new release once we have the validation here as well.

This means:

  • You do not have to normalize before counting the characters, the sentence passed into the function will already be normalized
  • The regex rules will be applied on the normalized sentence
  • We do not need the cleanup changes

Would be great if you could update your branch to be based on the latest main branch, and test out my addition and whether this works correctly for your validation as well.

Happy to provide any more context if needed and thanks for the suggestion!

Wouldn't it better to increase the max sentence length? Something like 70?
Common Voice recordings are limited to 10 sec recordings, mainly optimized for batch processing in 8GB GPU's.
If you limit your max-length further, you may be forced to exclude many sentences, which are hard to come by.
I'm asking without knowing the language of course...

Not by too much though. 70 would mean roughly 8.5 seconds on average. If somebody reads/speaks 20% slower than average, then it would already hit the 10 seconds hard limit.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking this as "request changes", see my previous comment.

@sftblw sftblw closed this Aug 8, 2022
@sftblw
Copy link
Contributor Author

sftblw commented Aug 8, 2022

Sorry, This PR close was mistake; I'm not familiar with forked Github repository and pressed the "sync" with discard button. I'll reopen after some modification.

@sftblw sftblw reopened this Aug 8, 2022
@sftblw
Copy link
Contributor Author

sftblw commented Aug 8, 2022

  • I made a new commit (because of the mistake)
  • Changed error message to Korean language (If I understood the recommendation correctly)
  • Removed some validations but left/modified some. Last validation will cover above two validation, but I left them to guide contributors.
    • ㅋㅋㅋ means lol (laught sound) and those characters are composed form but only with 1st letter of three (or two).
    • Non-composed codepoints will not be covered with [ㄱ-ㅎㅏ-ㅣ], since those codepoints are combined form. so Second one is yet needed (to guide contributors)
    • If we don't need guidance, that two validation rule is not required.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@MichaelKohler MichaelKohler merged commit 70183de into common-voice:main Aug 8, 2022
@MichaelKohler
Copy link
Member

🎉 This PR is included in version 2.18.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants