Feat/add language detection #1023

aminemindee · 2022-08-17T14:14:09Z

Hello this PR is to deal with the problem discussed here: #973 about the language detection part (It may need a rebase if the one about orientation passes before)

Add language detection model
Add it to predictor and output result
Add tests

codecov · 2022-08-17T14:30:40Z

Codecov Report

Merging #1023 (c33bf8c) into main (2ea3b54) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main    #1023   +/-   ##
=======================================
  Coverage   94.91%   94.91%           
=======================================
  Files         135      135           
  Lines        5570     5590   +20     
=======================================
+ Hits         5287     5306   +19     
- Misses        283      284    +1

Flag	Coverage Δ
unittests	`94.91% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/models/zoo.py	`100.00% <ø> (ø)`
doctr/models/_utils.py	`98.30% <100.00%> (+0.30%)`	⬆️
doctr/models/builder.py	`99.03% <100.00%> (+<0.01%)`	⬆️
doctr/models/predictor/pytorch.py	`100.00% <100.00%> (ø)`
doctr/models/predictor/tensorflow.py	`100.00% <100.00%> (ø)`
doctr/transforms/functional/base.py	`95.65% <0.00%> (-1.45%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

felixdittrich92 · 2022-08-18T08:20:01Z

Hi @aminemindee 👋,
thanks for the PR :)

Could you maybe add a short latency benchmark with lang detection and without as comparison ?

I agree that it is a great feature (which should be inside doctr) but i see fasttext a bit troublesome have you tested :
https://github.com/pemistahl/lingua-py ? Looks much better to me (requires only numpy and regex) :)

About docker: RuntimeError: Unsupported compiler -- at least C++11 support is needed!

aminemindee · 2022-08-18T13:30:36Z

Hello @felixdittrich92
Thank you for your answers and feedback.

When you say a benchmark with using the language detection and not do you see in what shape ?
A dummy test that will log the time it took and like expect with language detection to be longer ? or some other format ?

For the docker issue, you're right apparently fasttext is the one making it fail because it needs C++11.
The lingua-py lib you suggested seems really good. I'll try it and replace fasttext with that 👍

felixdittrich92 · 2022-08-18T13:47:51Z

Hello @felixdittrich92 Thank you for your answers and feedback.

When you say a benchmark with using the language detection and not do you see in what shape ? A dummy test that will log the time it took and like expect with language detection to be longer ? or some other format ?

For the docker issue, you're right apparently fasttext is the one making it fail because it needs C++11. The lingua-py lib you suggested seems really good. I'll try it and replace fasttext with that +1

Hi @aminemindee, yeah i'm fine if you test it locally on your side and share the time it took with language detection and without maybe for 2 - 3 test documents :)

Sounds good to me (while overflying the lib uses only N-Grams so it should also be much faster as fasttext) 👍

aminemindee · 2022-08-19T13:55:03Z

Hello @felixdittrich92 Thank you for your answers and feedback.
When you say a benchmark with using the language detection and not do you see in what shape ? A dummy test that will log the time it took and like expect with language detection to be longer ? or some other format ?
For the docker issue, you're right apparently fasttext is the one making it fail because it needs C++11. The lingua-py lib you suggested seems really good. I'll try it and replace fasttext with that +1

Hi @aminemindee, yeah i'm fine if you test it locally on your side and share the time it took with language detection and without maybe for 2 - 3 test documents :)

Sounds good to me (while overflying the lib uses only N-Grams so it should also be much faster as fasttext) +1

Hello @felixdittrich92, I did a benchmark using fasttext, langdetect and lingua-py using 3 pdfs with a total of 24 pages.

model	timeit value	start - end time
None	12.6 s ± 25.1 ms	12.55 s
fasttext	12.5 s ± 149 ms	12.79 s
langdetect	12.3 s ± 96.5	12.57 s
lingua-py	13.1 s ± 100 ms	13.28 s

It's weird that sometimes it's fast when using a language detection model. ( Probably it's too insignificant compared to the rest)

The lingua-py become slower the more languages you tell it to handle ( in this benchmark i used all available languages) also something to take into consideration in there documentation there is this:

Including all languages available in the library consumes approximately 3GB of memory and might lead to slow runtime performance.

felixdittrich92 · 2022-08-19T18:02:44Z

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)

Was lingua-py accure ?

aminemindee · 2022-08-22T14:02:52Z

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)

Was lingua-py accure ?

Hello @felixdittrich92 if i understand correctly you suggest to only include the languages that are present in doctr/datasets/vocabs.py (english, french, portuguese, spanish, german, arabic, czech and vietnamese) right ? i'll run a benchmark with only these languages loaded

For the performance of lingua-py for the documents i tested it got the right answer with probabilities of 1.0

felixdittrich92 · 2022-08-22T14:16:41Z

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)
Was lingua-py accure ?

Hello @felixdittrich92 if i understand correctly you suggest to only include the languages that are present in doctr/datasets/vocabs.py (english, french, portuguese, spanish, german, arabic, czech and vietnamese) right ? i'll run a benchmark with only these languages loaded

For the performance of lingua-py for the documents i tested it got the right answer with probabilities of 1.0

@aminemindee yes that's what i mean :) prob sounds also good 😅 (we could also grab the vocabs via enum that it updates the language detection If we add a new vocab 👍 )

aminemindee · 2022-08-22T15:49:31Z

Hi @aminemindee i think if we can choose it is enough to support language detection where we also support vocabs (See datasets/vocabs)
Was lingua-py accure ?

Hello @felixdittrich92 if i understand correctly you suggest to only include the languages that are present in doctr/datasets/vocabs.py (english, french, portuguese, spanish, german, arabic, czech and vietnamese) right ? i'll run a benchmark with only these languages loaded
For the performance of lingua-py for the documents i tested it got the right answer with probabilities of 1.0

@aminemindee yes that's what i mean :) prob sounds also good sweat_smile (we could also grab the vocabs via enum that it updates the language detection If we add a new vocab +1 )

Yep really good idea. I tried to do that you tell me if it's possible to do it maybe cleaner.
I added also the params detect_orientation and detect_language to ocr_predictor function because it's the most used.

felixdittrich92

Hi @aminemindee 👋,

thanks for updating this 👍
I have added some minor comments :)

Would be great if you can add also the benchmark between lingua all Languages and only these which are available in vocabs 💯

doctr/utils/lang_detect.py

doctr/models/zoo.py

doctr/models/predictor/tensorflow.py

aminemindee · 2022-08-23T13:07:42Z

Hi @aminemindee wave,

thanks for updating this +1 I have added some minor comments :)

Would be great if you can add also the benchmark between lingua all Languages and only these which are available in vocabs 100

For the benchmark :

model	timeit value	start - end time
None	12.6 s ± 25.1 ms	12.55 s
fasttext	12.5 s ± 149 ms	12.79 s
langdetect	12.3 s ± 96.5 ms	12.57 s
lingua-py	13.1 s ± 100 ms	13.28 s
lingua-py selected languages	12.3 s ± 125 ms	12.56

felixdittrich92 · 2022-08-23T13:12:35Z

Hi @aminemindee wave,
thanks for updating this +1 I have added some minor comments :)
Would be great if you can add also the benchmark between lingua all Languages and only these which are available in vocabs 100

For the benchmark :

model timeit value start - end time
None 12.6 s ± 25.1 ms 12.55 s
fasttext 12.5 s ± 149 ms 12.79 s
langdetect 12.3 s ± 96.5 ms 12.57 s
lingua-py 13.1 s ± 100 ms 13.28 s
lingua-py selected languages 12.3 s ± 125 ms 12.56

Mh .. I thought we could speed it up a bit more but it should be ok for now. :)

Short question does your benchmark include OCR times ? 😅

felixdittrich92 · 2022-08-23T13:19:11Z

doctr/utils/lang_detect.py

+        The detected language in ISO 639 code and confidence score
+    """
+
+    model = LanguageDetectorBuilder.from_languages(*languages).build()


@aminemindee i think we should init the model outside of the function otherwise it will be initialized again with every call

i think this takes also some time maybe an overall speed up ? There would be the question how much RAM it needs !?

I saw also that there is a with_low_accuracy_mode method from DetectorBuilder could we test if it is maybe enough for our use case ?

We need to be careful with some latency drops and high memory consumptions (yes i know it is optional currently, but i think it makes sense to check this) :)

Done and pushed.
we gain 0.3 s 👍

model timeit value start - end time

None 12.6 s ± 25.1 ms 12.55 s

fasttext 12.5 s ± 149 ms 12.79 s

langdetect 12.3 s ± 96.5 ms 12.57 s

lingua-py 13.1 s ± 100 ms 13.28 s

lingua-py selected languages 12 s ± 86.1 ms 12.22

can you tell me where is the with_low_accuracy_mode i don't seem to find it.

For the RAM, using only the Nine languages in vocabs it uses 0.7 gb of ram

puh .. than we should test that with_low_accuracy_mode is enough

with with_low_accuracy_mode the sentence 'this is a sentence.' get predicted as portuguese x) the RAM goes down to 0.3 gb.

lang_detect use also 0.3 gb of RAM and it uses an AI based approach meaning the confidence score is more accurate as a probability and not a distance.

Mh i would say that 0.8 gb is to much .. so in fact of your tests do you think it is ok switching to lang_detect ? Let's try this and than let me know your decision i will go with it :)

From what i tested:

fasttext is the best one performance wise but you have to download model and it's the larger one.

langdetect is the easiest one to use but it's not really well done ( you have to do a try except because he return error if the text doesn't have any character he knows)

lingua seems really good but uses too much RAM and it's performance is not that good with the low accuracy mode.

So for now i think langdetect is the reasonable choice, knowing we can change it easily after.

tests/common/test_vocab_languages.py

doctr/utils/lang_detect.py

felixdittrich92

@aminemindee thanks for all the updates:hugs:

aminemindee mentioned this pull request Aug 17, 2022

Fix orientation details export #1022

Merged

4 tasks

frgfm added module: models Related to doctr.models type: new feature New feature labels Aug 17, 2022

felixdittrich92 added this to the 0.6.0 milestone Aug 18, 2022

felixdittrich92 self-assigned this Aug 19, 2022

aminemindee added 3 commits August 19, 2022 10:07

feat add language detection to ocr predictor

659ae33

fix tests

847caa1

remove print and fix tests

d35a41b

feat add language detection using lingua and tests

45c4b5d

aminemindee force-pushed the feat/add_language_detection branch from 36f0627 to 45c4b5d Compare August 22, 2022 15:40

fix lang detect test

ea05c31

remove test not persistent (text changes)

2a636cc

felixdittrich92 self-requested a review August 23, 2022 06:38

felixdittrich92 requested changes Aug 23, 2022

View reviewed changes

fix review comments: fix documentation and languages from vocab

b0734e5

felixdittrich92 reviewed Aug 23, 2022

View reviewed changes

tests/common/test_vocab_languages.py Outdated Show resolved Hide resolved

felixdittrich92 reviewed Aug 23, 2022

View reviewed changes

doctr/utils/lang_detect.py Outdated Show resolved Hide resolved

make model load outside function to not load multiple times

1932518

aminemindee added 2 commits August 23, 2022 16:18

move language detection into modelds utils and the tests

a480ea6

replace lingua with langdetect because of memory usage too large

c33bf8c

felixdittrich92 approved these changes Aug 23, 2022

View reviewed changes

felixdittrich92 merged commit faf5590 into mindee:main Aug 23, 2022

felixdittrich92 mentioned this pull request Sep 26, 2022

Release tracker - v0.6.0 #791

Closed

85 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/add language detection #1023

Feat/add language detection #1023

aminemindee commented Aug 17, 2022 •

edited

Loading

codecov bot commented Aug 17, 2022 •

edited

Loading

felixdittrich92 commented Aug 18, 2022

aminemindee commented Aug 18, 2022

felixdittrich92 commented Aug 18, 2022 •

edited

Loading

aminemindee commented Aug 19, 2022 •

edited

Loading

felixdittrich92 commented Aug 19, 2022 •

edited

Loading

aminemindee commented Aug 22, 2022

felixdittrich92 commented Aug 22, 2022

aminemindee commented Aug 22, 2022

felixdittrich92 left a comment

aminemindee commented Aug 23, 2022

felixdittrich92 commented Aug 23, 2022 •

edited

Loading

felixdittrich92 Aug 23, 2022

felixdittrich92 Aug 23, 2022 •

edited

Loading

felixdittrich92 Aug 23, 2022 •

edited

Loading

aminemindee Aug 23, 2022

aminemindee Aug 23, 2022

aminemindee Aug 23, 2022

felixdittrich92 Aug 23, 2022

aminemindee Aug 23, 2022 •

edited

Loading

felixdittrich92 Aug 23, 2022

aminemindee Aug 23, 2022

felixdittrich92 left a comment

Feat/add language detection #1023

Feat/add language detection #1023

Conversation

aminemindee commented Aug 17, 2022 • edited Loading

codecov bot commented Aug 17, 2022 • edited Loading

Codecov Report

felixdittrich92 commented Aug 18, 2022

aminemindee commented Aug 18, 2022

felixdittrich92 commented Aug 18, 2022 • edited Loading

aminemindee commented Aug 19, 2022 • edited Loading

felixdittrich92 commented Aug 19, 2022 • edited Loading

aminemindee commented Aug 22, 2022

felixdittrich92 commented Aug 22, 2022

aminemindee commented Aug 22, 2022

felixdittrich92 left a comment

Choose a reason for hiding this comment

aminemindee commented Aug 23, 2022

felixdittrich92 commented Aug 23, 2022 • edited Loading

felixdittrich92 Aug 23, 2022

Choose a reason for hiding this comment

felixdittrich92 Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

felixdittrich92 Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

aminemindee Aug 23, 2022

Choose a reason for hiding this comment

aminemindee Aug 23, 2022

Choose a reason for hiding this comment

aminemindee Aug 23, 2022

Choose a reason for hiding this comment

felixdittrich92 Aug 23, 2022

Choose a reason for hiding this comment

aminemindee Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

felixdittrich92 Aug 23, 2022

Choose a reason for hiding this comment

aminemindee Aug 23, 2022

Choose a reason for hiding this comment

felixdittrich92 left a comment

Choose a reason for hiding this comment

aminemindee commented Aug 17, 2022 •

edited

Loading

codecov bot commented Aug 17, 2022 •

edited

Loading

felixdittrich92 commented Aug 18, 2022 •

edited

Loading

aminemindee commented Aug 19, 2022 •

edited

Loading

felixdittrich92 commented Aug 19, 2022 •

edited

Loading

felixdittrich92 commented Aug 23, 2022 •

edited

Loading

felixdittrich92 Aug 23, 2022 •

edited

Loading

felixdittrich92 Aug 23, 2022 •

edited

Loading

aminemindee Aug 23, 2022 •

edited

Loading