Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/add language detection #1023

Merged
merged 10 commits into from
Aug 23, 2022

Conversation

aminemindee
Copy link
Contributor

@aminemindee aminemindee commented Aug 17, 2022

Hello this PR is to deal with the problem discussed here: #973 about the language detection part (It may need a rebase if the one about orientation passes before)

  • Add language detection model
  • Add it to predictor and output result
  • Add tests

@aminemindee aminemindee mentioned this pull request Aug 17, 2022
4 tasks
@codecov
Copy link

codecov bot commented Aug 17, 2022

Codecov Report

Merging #1023 (c33bf8c) into main (2ea3b54) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1023   +/-   ##
=======================================
  Coverage   94.91%   94.91%           
=======================================
  Files         135      135           
  Lines        5570     5590   +20     
=======================================
+ Hits         5287     5306   +19     
- Misses        283      284    +1     
Flag Coverage Δ
unittests 94.91% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/models/zoo.py 100.00% <ø> (ø)
doctr/models/_utils.py 98.30% <100.00%> (+0.30%) ⬆️
doctr/models/builder.py 99.03% <100.00%> (+<0.01%) ⬆️
doctr/models/predictor/pytorch.py 100.00% <100.00%> (ø)
doctr/models/predictor/tensorflow.py 100.00% <100.00%> (ø)
doctr/transforms/functional/base.py 95.65% <0.00%> (-1.45%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@frgfm frgfm added module: models Related to doctr.models type: new feature New feature labels Aug 17, 2022
@felixdittrich92 felixdittrich92 added this to the 0.6.0 milestone Aug 18, 2022
@felixdittrich92
Copy link
Contributor

Hi @aminemindee 👋,
thanks for the PR :)

Could you maybe add a short latency benchmark with lang detection and without as comparison ?

I agree that it is a great feature (which should be inside doctr) but i see fasttext a bit troublesome have you tested :
https://github.com/pemistahl/lingua-py ? Looks much better to me (requires only numpy and regex) :)

About docker: RuntimeError: Unsupported compiler -- at least C++11 support is needed!

@aminemindee
Copy link
Contributor Author

Hello @felixdittrich92
Thank you for your answers and feedback.

When you say a benchmark with using the language detection and not do you see in what shape ?
A dummy test that will log the time it took and like expect with language detection to be longer ? or some other format ?

For the docker issue, you're right apparently fasttext is the one making it fail because it needs C++11.
The lingua-py lib you suggested seems really good. I'll try it and replace fasttext with that 👍

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Aug 18, 2022

Hello @felixdittrich92 Thank you for your answers and feedback.

When you say a benchmark with using the language detection and not do you see in what shape ? A dummy test that will log the time it took and like expect with language detection to be longer ? or some other format ?

For the docker issue, you're right apparently fasttext is the one making it fail because it needs C++11. The lingua-py lib you suggested seems really good. I'll try it and replace fasttext with that +1

Hi @aminemindee, yeah i'm fine if you test it locally on your side and share the time it took with language detection and without maybe for 2 - 3 test documents :)

Sounds good to me (while overflying the lib uses only N-Grams so it should also be much faster as fasttext) 👍

@felixdittrich92 felixdittrich92 self-assigned this Aug 19, 2022
@aminemindee
Copy link
Contributor Author

aminemindee commented Aug 19, 2022

Hello @felixdittrich92 Thank you for your answers and feedback.
When you say a benchmark with using the language detection and not do you see in what shape ? A dummy test that will log the time it took and like expect with language detection to be longer ? or some other format ?
For the docker issue, you're right apparently fasttext is the one making it fail because it needs C++11. The lingua-py lib you suggested seems really good. I'll try it and replace fasttext with that +1

Hi @aminemindee, yeah i'm fine if you test it locally on your side and share the time it took with language detection and without maybe for 2 - 3 test documents :)

Sounds good to me (while overflying the lib uses only N-Grams so it should also be much faster as fasttext) +1

Hello @felixdittrich92, I did a benchmark using fasttext, langdetect and lingua-py using 3 pdfs with a total of 24 pages.

model timeit value start - end time
None 12.6 s ± 25.1 ms 12.55 s
fasttext 12.5 s ± 149 ms 12.79 s
langdetect 12.3 s ± 96.5 12.57 s
lingua-py 13.1 s ± 100 ms 13.28 s

It's weird that sometimes it's fast when using a language detection model. ( Probably it's too insignificant compared to the rest)

The lingua-py become slower the more languages you tell it to handle ( in this benchmark i used all available languages) also something to take into consideration in there documentation there is this:

Including all languages available in the library consumes approximately 3GB of memory and might lead to slow runtime performance.

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Aug 19, 2022

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)

Was lingua-py accure ?

@aminemindee
Copy link
Contributor Author

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)

Was lingua-py accure ?

Hello @felixdittrich92 if i understand correctly you suggest to only include the languages that are present in doctr/datasets/vocabs.py (english, french, portuguese, spanish, german, arabic, czech and vietnamese) right ? i'll run a benchmark with only these languages loaded

For the performance of lingua-py for the documents i tested it got the right answer with probabilities of 1.0

@felixdittrich92
Copy link
Contributor

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)
Was lingua-py accure ?

Hello @felixdittrich92 if i understand correctly you suggest to only include the languages that are present in doctr/datasets/vocabs.py (english, french, portuguese, spanish, german, arabic, czech and vietnamese) right ? i'll run a benchmark with only these languages loaded

For the performance of lingua-py for the documents i tested it got the right answer with probabilities of 1.0

@aminemindee yes that's what i mean :) prob sounds also good 😅 (we could also grab the vocabs via enum that it updates the language detection If we add a new vocab 👍 )

@aminemindee aminemindee force-pushed the feat/add_language_detection branch from 36f0627 to 45c4b5d Compare August 22, 2022 15:40
@aminemindee
Copy link
Contributor Author

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)
Was lingua-py accure ?

Hello @felixdittrich92 if i understand correctly you suggest to only include the languages that are present in doctr/datasets/vocabs.py (english, french, portuguese, spanish, german, arabic, czech and vietnamese) right ? i'll run a benchmark with only these languages loaded
For the performance of lingua-py for the documents i tested it got the right answer with probabilities of 1.0

@aminemindee yes that's what i mean :) prob sounds also good sweat_smile (we could also grab the vocabs via enum that it updates the language detection If we add a new vocab +1 )

Yep really good idea. I tried to do that you tell me if it's possible to do it maybe cleaner.
I added also the params detect_orientation and detect_language to ocr_predictor function because it's the most used.

@felixdittrich92 felixdittrich92 self-requested a review August 23, 2022 06:38
Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aminemindee 👋,

thanks for updating this 👍
I have added some minor comments :)

Would be great if you can add also the benchmark between lingua all Languages and only these which are available in vocabs 💯

doctr/utils/lang_detect.py Outdated Show resolved Hide resolved
doctr/utils/lang_detect.py Outdated Show resolved Hide resolved
doctr/utils/lang_detect.py Outdated Show resolved Hide resolved
doctr/utils/lang_detect.py Outdated Show resolved Hide resolved
doctr/models/zoo.py Show resolved Hide resolved
doctr/models/predictor/tensorflow.py Show resolved Hide resolved
@aminemindee
Copy link
Contributor Author

Hi @aminemindee wave,

thanks for updating this +1 I have added some minor comments :)

Would be great if you can add also the benchmark between lingua all Languages and only these which are available in vocabs 100

For the benchmark :

model timeit value start - end time
None 12.6 s ± 25.1 ms 12.55 s
fasttext 12.5 s ± 149 ms 12.79 s
langdetect 12.3 s ± 96.5 ms 12.57 s
lingua-py 13.1 s ± 100 ms 13.28 s
lingua-py selected languages 12.3 s ± 125 ms 12.56

@felixdittrich92
Copy link
Contributor

felixdittrich92 commented Aug 23, 2022

Hi @aminemindee wave,
thanks for updating this +1 I have added some minor comments :)
Would be great if you can add also the benchmark between lingua all Languages and only these which are available in vocabs 100

For the benchmark :

model timeit value start - end time
None 12.6 s ± 25.1 ms 12.55 s
fasttext 12.5 s ± 149 ms 12.79 s
langdetect 12.3 s ± 96.5 ms 12.57 s
lingua-py 13.1 s ± 100 ms 13.28 s
lingua-py selected languages 12.3 s ± 125 ms 12.56

Mh .. I thought we could speed it up a bit more but it should be ok for now. :)

Short question does your benchmark include OCR times ? 😅

The detected language in ISO 639 code and confidence score
"""

model = LanguageDetectorBuilder.from_languages(*languages).build()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aminemindee i think we should init the model outside of the function otherwise it will be initialized again with every call

Copy link
Contributor

@felixdittrich92 felixdittrich92 Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this takes also some time maybe an overall speed up ? There would be the question how much RAM it needs !?

Copy link
Contributor

@felixdittrich92 felixdittrich92 Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw also that there is a with_low_accuracy_mode method from DetectorBuilder could we test if it is maybe enough for our use case ?

We need to be careful with some latency drops and high memory consumptions (yes i know it is optional currently, but i think it makes sense to check this) :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and pushed.
we gain 0.3 s 👍

model timeit value start - end time
None 12.6 s ± 25.1 ms 12.55 s
fasttext 12.5 s ± 149 ms 12.79 s
langdetect 12.3 s ± 96.5 ms 12.57 s
lingua-py 13.1 s ± 100 ms 13.28 s
lingua-py selected languages 12 s ± 86.1 ms 12.22

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you tell me where is the with_low_accuracy_mode i don't seem to find it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the RAM, using only the Nine languages in vocabs it uses 0.7 gb of ram

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

puh .. than we should test that with_low_accuracy_mode is enough

Copy link
Contributor Author

@aminemindee aminemindee Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with with_low_accuracy_mode the sentence 'this is a sentence.' get predicted as portuguese x) the RAM goes down to 0.3 gb.

lang_detect use also 0.3 gb of RAM and it uses an AI based approach meaning the confidence score is more accurate as a probability and not a distance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh i would say that 0.8 gb is to much .. so in fact of your tests do you think it is ok switching to lang_detect ? Let's try this and than let me know your decision i will go with it :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what i tested:

  • fasttext is the best one performance wise but you have to download model and it's the larger one.
  • langdetect is the easiest one to use but it's not really well done ( you have to do a try except because he return error if the text doesn't have any character he knows)
  • lingua seems really good but uses too much RAM and it's performance is not that good with the low accuracy mode.

So for now i think langdetect is the reasonable choice, knowing we can change it easily after.

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aminemindee thanks for all the updates:hugs:

@felixdittrich92 felixdittrich92 merged commit faf5590 into mindee:main Aug 23, 2022
@felixdittrich92 felixdittrich92 mentioned this pull request Sep 26, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: models Related to doctr.models type: new feature New feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants