-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding an Arabic vocab file #514
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thanks for the PR!
Could you please move this to doctr/datasets/vocabs.py
?
And as you mentioned, your are aggregating Hindi + Arabic + Farsi characters if I understood well, would it be possible to split those entries in the dictionary right there (below ancient greek):
VOCABS: Dict[str, str] = {
'digits': string.digits,
'ascii_letters': string.ascii_letters,
'punctuation': string.punctuation,
'currency': '£€¥¢฿',
'ancient_greek': 'αβγδεζηθικλμνξοπρστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ',
}
For instance: 'hindi_letters', 'arabic_diacritics', etc... And try to be as exhaustive as possibe for each sub class. Then you can add below the dictionnary your full vocab (you can follow the example of european vocabs):
VOCABS['arabic'] = VOCABS['hindi_letters'] + VOCABS['farsi'] + VOCABS['arabic_diacritics'] + ...
Thanks for that ! 🙏
Thanks @charlesmindee for your help and guidance. I appreciate it. I am new to PR pushing, so I hope I am doing it right. Please find the changes I made here Instead of creating a new item "arabic_numbers", I am using "VOCABS['digits']". Please let me know if there is anything else that I need to do at my end. Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the edits! Just a small modification and it looks like we're good to merge!
3faad13
to
986732d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the edit, you only need to split the last line in 2 lines to pass the flake8 test (code style enforcement, each line must be < 120) and we are good to merge (all the other tests are OK)!
closes #490 |
Codecov Report
@@ Coverage Diff @@
## main #514 +/- ##
=======================================
Coverage 95.38% 95.38%
=======================================
Files 109 109
Lines 4183 4184 +1
=======================================
+ Hits 3990 3991 +1
Misses 193 193
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, I added some suggestions to fix flake8 and improve naming 👌
986732d
to
aaae04c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for edits, looks good to me!
I have added Arabic alphabet plus some Farsi alphabet because they may exist in many Arabic texts. I also added "Hindi" numbers and Arabic diacritics. Arabic uses connected shaping when it comes to characters, so characters are not isolated as in Latin-based languages. Hope this won't be an issue. Please let me know if you have any questions.