Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datasets][PoC] Enable dataset usage for recognition task #867

Closed
wants to merge 32 commits into from

Conversation

felixdittrich92
Copy link
Contributor

@felixdittrich92 felixdittrich92 commented Mar 23, 2022

This PR is handled as Proof of Concept for further discussions to enable the ability to use existing datasets also for recognition task ( main goal: benchmarks ).
It's easier to show the idea directly in code instead of opening a discussion.

Things to investigate if the concept should be fine:

  • Synthtext cropping (if use_polygons=True) is to slow (multiprocessing ? maybe in another PR
    reminder: maybe a good reference)

@fg-mindee
No worry we can split this later in parts (maybe geometry, torch, tf) for review if you want 😅

Issue:
#855 First task of this
A good documentation would be part two
(ATTENTION: reminder for docs: SROIE & SVT does only provide uppercase labels and does not match the 'case-sensitive' in images)

Any feedback is very welcome 👍
@charlesmindee @SiddhantBahuguna @fg-mindee

@codecov
Copy link

codecov bot commented Mar 23, 2022

Codecov Report

Merging #867 (31de3ae) into main (9d03085) will increase coverage by 0.11%.
The diff coverage is 98.36%.

@@            Coverage Diff             @@
##             main     #867      +/-   ##
==========================================
+ Coverage   94.82%   94.94%   +0.11%     
==========================================
  Files         133      133              
  Lines        5200     5358     +158     
==========================================
+ Hits         4931     5087     +156     
- Misses        269      271       +2     
Flag Coverage Δ
unittests 94.94% <98.36%> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
doctr/datasets/utils.py 94.44% <90.00%> (-0.80%) ⬇️
doctr/datasets/synthtext.py 94.73% <92.59%> (-2.41%) ⬇️
doctr/datasets/cord.py 97.72% <100.00%> (+0.29%) ⬆️
doctr/datasets/datasets/pytorch.py 100.00% <100.00%> (ø)
doctr/datasets/datasets/tensorflow.py 100.00% <100.00%> (ø)
doctr/datasets/funsd.py 97.36% <100.00%> (+0.39%) ⬆️
doctr/datasets/ic03.py 97.72% <100.00%> (+0.35%) ⬆️
doctr/datasets/ic13.py 96.77% <100.00%> (+0.62%) ⬆️
doctr/datasets/iiit5k.py 96.96% <100.00%> (+0.19%) ⬆️
doctr/datasets/imgur5k.py 93.33% <100.00%> (+0.83%) ⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d03085...31de3ae. Read the comment docs.

@felixdittrich92 felixdittrich92 changed the title [WIP][PoC] Improve dataset usage for recognition task [WIP][PoC] Enable dataset usage for recognition task Mar 23, 2022
@felixdittrich92 felixdittrich92 changed the title [WIP][PoC] Enable dataset usage for recognition task [PoC] Enable dataset usage for recognition task Mar 24, 2022
@felixdittrich92 felixdittrich92 changed the title [PoC] Enable dataset usage for recognition task [datasets][PoC] Enable dataset usage for recognition task Mar 24, 2022
@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Mar 30, 2022

@fg-mindee
I saw we have already extract_crops extract_rcrops in models/_utils.py can we maybe move this to geometry ?
provided the PoC is a match 😅

@felixdittrich92
Copy link
Contributor Author

@frgfm
Same here any feedback ? 😄

@frgfm
Copy link
Collaborator

frgfm commented Apr 6, 2022

Hey there 🙂

So I had thought about this a few months back. To make sure we are all on the same page, the goal is to:

  • take OCR annotated datasets
  • make a text recognition dataset out of it

Correct?

If so, two major options arise:

  • doing this dynamically (at it getitem)
  • doing this statically in the constructor

If this is for training, I'd argue the second option is the one for a few reasons :

  • latency
  • memory consumption (opening an image 100 times as big to crop it, in a number of samples that isn't necessarily the batch size comes with heavy consequences)

So perhaps this could be done in a temporary or cache folder 🤷‍♂️

What do you think?

@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Apr 6, 2022

@frgfm
Yes you are right that's the goal (for all currently implemented datasets - without the obj detection one) :)
I have had this ways also in mind so currently i have implemented the 'constructor' option (which works really good) with one problem .... SynthText 😓 it is much to slow and needs a lot of memory (really much ~30gb ram) ...but thanks for your opinon now i can iterate on this and find a way to make it better performing 👍
I was also on track to think about saving for faster reload but it's no option it would waste a lot of the users space
One last thing for the moment are you fine if i move extract_rcrops / extract_crops to geometry otherwise we will have duplicated stuff !? :)
'
EDIT: SynthText also fixed thats the only dataset we need to store as pickle file inside the .cache/doctr/datasets/SynthText so the reloading is also much faster 👍 the other datasets are fine from latency and memory consumption without storing
#889 does the first part
Part2 than datasets + tests

Offtopic: Im really a bit hyped to train the first models on SynthText and MJSynth when we are done with it 😅

@felixdittrich92
Copy link
Contributor Author

I think it's mostly done i will split it into 2 PRs for easier reviews 👍

Copy link
Contributor

@fharper fharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the issues found by Codacy & CodeFactor please?

@felixdittrich92
Copy link
Contributor Author

@fharper that' s only a PoC PR will be split into 2 PRs for review 👍
Part 1: #889

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants