-
Notifications
You must be signed in to change notification settings - Fork 465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/add language detection #1023
Feat/add language detection #1023
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1023 +/- ##
=======================================
Coverage 94.91% 94.91%
=======================================
Files 135 135
Lines 5570 5590 +20
=======================================
+ Hits 5287 5306 +19
- Misses 283 284 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Hi @aminemindee 👋, Could you maybe add a short latency benchmark with lang detection and without as comparison ? I agree that it is a great feature (which should be inside doctr) but i see fasttext a bit troublesome have you tested : About docker: |
Hello @felixdittrich92 When you say a benchmark with using the language detection and not do you see in what shape ? For the docker issue, you're right apparently fasttext is the one making it fail because it needs C++11. |
Hi @aminemindee, yeah i'm fine if you test it locally on your side and share the time it took with language detection and without maybe for 2 - 3 test documents :) Sounds good to me (while overflying the lib uses only N-Grams so it should also be much faster as fasttext) 👍 |
Hello @felixdittrich92, I did a benchmark using fasttext, langdetect and lingua-py using 3 pdfs with a total of 24 pages.
It's weird that sometimes it's fast when using a language detection model. ( Probably it's too insignificant compared to the rest) The lingua-py become slower the more languages you tell it to handle ( in this benchmark i used all available languages) also something to take into consideration in there documentation there is this:
|
Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs) Was lingua-py accure ? |
Hello @felixdittrich92 if i understand correctly you suggest to only include the languages that are present in For the performance of lingua-py for the documents i tested it got the right answer with probabilities of 1.0 |
@aminemindee yes that's what i mean :) prob sounds also good 😅 (we could also grab the vocabs via enum that it updates the language detection If we add a new vocab 👍 ) |
36f0627
to
45c4b5d
Compare
Yep really good idea. I tried to do that you tell me if it's possible to do it maybe cleaner. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @aminemindee 👋,
thanks for updating this 👍
I have added some minor comments :)
Would be great if you can add also the benchmark between lingua all Languages and only these which are available in vocabs 💯
For the benchmark :
|
Mh .. I thought we could speed it up a bit more but it should be ok for now. :) Short question does your benchmark include OCR times ? 😅 |
doctr/utils/lang_detect.py
Outdated
The detected language in ISO 639 code and confidence score | ||
""" | ||
|
||
model = LanguageDetectorBuilder.from_languages(*languages).build() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aminemindee i think we should init the model outside of the function otherwise it will be initialized again with every call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this takes also some time maybe an overall speed up ? There would be the question how much RAM it needs !?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw also that there is a with_low_accuracy_mode
method from DetectorBuilder could we test if it is maybe enough for our use case ?
We need to be careful with some latency drops and high memory consumptions (yes i know it is optional currently, but i think it makes sense to check this) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and pushed.
we gain 0.3 s 👍
model | timeit value | start - end time |
---|---|---|
None | 12.6 s ± 25.1 ms | 12.55 s |
fasttext | 12.5 s ± 149 ms | 12.79 s |
langdetect | 12.3 s ± 96.5 ms | 12.57 s |
lingua-py | 13.1 s ± 100 ms | 13.28 s |
lingua-py selected languages | 12 s ± 86.1 ms | 12.22 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tell me where is the with_low_accuracy_mode
i don't seem to find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the RAM, using only the Nine languages in vocabs it uses 0.7 gb of ram
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
puh .. than we should test that with_low_accuracy_mode
is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with with_low_accuracy_mode
the sentence 'this is a sentence.' get predicted as portuguese x) the RAM goes down to 0.3 gb.
lang_detect use also 0.3 gb of RAM and it uses an AI based approach meaning the confidence score is more accurate as a probability and not a distance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh i would say that 0.8 gb is to much .. so in fact of your tests do you think it is ok switching to lang_detect ? Let's try this and than let me know your decision i will go with it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what i tested:
- fasttext is the best one performance wise but you have to download model and it's the larger one.
- langdetect is the easiest one to use but it's not really well done ( you have to do a try except because he return error if the text doesn't have any character he knows)
- lingua seems really good but uses too much RAM and it's performance is not that good with the low accuracy mode.
So for now i think langdetect is the reasonable choice, knowing we can change it easily after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aminemindee thanks for all the updates:hugs:
Hello this PR is to deal with the problem discussed here: #973 about the language detection part (It may need a rebase if the one about orientation passes before)