[datasets][PoC] Enable dataset usage for recognition task #867

felixdittrich92 · 2022-03-23T09:58:53Z

This PR is handled as Proof of Concept for further discussions to enable the ability to use existing datasets also for recognition task ( main goal: benchmarks ).
It's easier to show the idea directly in code instead of opening a discussion.

Things to investigate if the concept should be fine:

Synthtext cropping (if use_polygons=True) is to slow (multiprocessing ? maybe in another PR
reminder: maybe a good reference)

@fg-mindee
No worry we can split this later in parts (maybe geometry, torch, tf) for review if you want 😅

Issue:
#855 First task of this
A good documentation would be part two
(ATTENTION: reminder for docs: SROIE & SVT does only provide uppercase labels and does not match the 'case-sensitive' in images)

Any feedback is very welcome 👍
@charlesmindee @SiddhantBahuguna @fg-mindee

codecov · 2022-03-23T10:08:05Z

Codecov Report

Merging #867 (31de3ae) into main (9d03085) will increase coverage by 0.11%.
The diff coverage is 98.36%.

@@            Coverage Diff             @@
##             main     #867      +/-   ##
==========================================
+ Coverage   94.82%   94.94%   +0.11%     
==========================================
  Files         133      133              
  Lines        5200     5358     +158     
==========================================
+ Hits         4931     5087     +156     
- Misses        269      271       +2

Flag	Coverage Δ
unittests	`94.94% <98.36%> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
doctr/datasets/utils.py	`94.44% <90.00%> (-0.80%)`	⬇️
doctr/datasets/synthtext.py	`94.73% <92.59%> (-2.41%)`	⬇️
doctr/datasets/cord.py	`97.72% <100.00%> (+0.29%)`	⬆️
doctr/datasets/datasets/pytorch.py	`100.00% <100.00%> (ø)`
doctr/datasets/datasets/tensorflow.py	`100.00% <100.00%> (ø)`
doctr/datasets/funsd.py	`97.36% <100.00%> (+0.39%)`	⬆️
doctr/datasets/ic03.py	`97.72% <100.00%> (+0.35%)`	⬆️
doctr/datasets/ic13.py	`96.77% <100.00%> (+0.62%)`	⬆️
doctr/datasets/iiit5k.py	`96.96% <100.00%> (+0.19%)`	⬆️
doctr/datasets/imgur5k.py	`93.33% <100.00%> (+0.83%)`	⬆️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9d03085...31de3ae. Read the comment docs.

felixdittrich92 · 2022-03-30T09:52:57Z

@fg-mindee
I saw we have already extract_crops extract_rcrops in models/_utils.py can we maybe move this to geometry ?
provided the PoC is a match 😅

felixdittrich92 · 2022-04-05T21:19:28Z

@frgfm
Same here any feedback ? 😄

frgfm · 2022-04-06T17:32:44Z

Hey there 🙂

So I had thought about this a few months back. To make sure we are all on the same page, the goal is to:

take OCR annotated datasets
make a text recognition dataset out of it

Correct?

If so, two major options arise:

doing this dynamically (at it getitem)
doing this statically in the constructor

If this is for training, I'd argue the second option is the one for a few reasons :

latency
memory consumption (opening an image 100 times as big to crop it, in a number of samples that isn't necessarily the batch size comes with heavy consequences)

So perhaps this could be done in a temporary or cache folder 🤷‍♂️

What do you think?

felixdittrich92 · 2022-04-06T19:50:18Z

@frgfm
Yes you are right that's the goal (for all currently implemented datasets - without the obj detection one) :)
I have had this ways also in mind so currently i have implemented the 'constructor' option (which works really good) with one problem .... SynthText 😓 it is much to slow and needs a lot of memory (really much ~30gb ram) ...but thanks for your opinon now i can iterate on this and find a way to make it better performing 👍
I was also on track to think about saving for faster reload but it's no option it would waste a lot of the users space
One last thing for the moment are you fine if i move extract_rcrops / extract_crops to geometry otherwise we will have duplicated stuff !? :)
'
EDIT: SynthText also fixed thats the only dataset we need to store as pickle file inside the .cache/doctr/datasets/SynthText so the reloading is also much faster 👍 the other datasets are fine from latency and memory consumption without storing
#889 does the first part
Part2 than datasets + tests

Offtopic: Im really a bit hyped to train the first models on SynthText and MJSynth when we are done with it 😅

felixdittrich92 · 2022-04-07T20:12:44Z

I think it's mostly done i will split it into 2 PRs for easier reviews 👍

fharper

Can you fix the issues found by Codacy & CodeFactor please?

felixdittrich92 · 2022-04-10T08:57:45Z

@fharper that' s only a PoC PR will be split into 2 PRs for review 👍
Part 1: #889

felixdittrich92 added 22 commits January 11, 2022 08:34

backup

81c313e

Merge branch 'mindee:main' into main

50574b5

Merge branch 'mindee:main' into main

5a6ed54

Merge branch 'mindee:main' into main

b9958a7

Merge branch 'mindee:main' into main

14c4651

Merge branch 'mindee:main' into main

779731f

Merge branch 'mindee:main' into main

ce2cdda

Merge branch 'mindee:main' into main

d13dc43

Merge branch 'mindee:main' into main

9a07d73

Merge branch 'mindee:main' into main

a002a70

Merge branch 'mindee:main' into main

6ad096e

Merge branch 'mindee:main' into main

1e77fd4

Merge branch 'mindee:main' into main

2be762c

Merge branch 'mindee:main' into main

e2f2055

Merge branch 'mindee:main' into main

bdc4e67

Merge branch 'mindee:main' into main

b525021

Merge branch 'mindee:main' into main

417a27b

Merge branch 'mindee:main' into main

9b3f5a1

Merge branch 'mindee:main' into main

93074a8

Merge branch 'mindee:main' into main

c64e209

start crop-pre

041156d

first PoC

687b53a

felixdittrich92 changed the title ~~[WIP][PoC] Improve dataset usage for recognition task~~ [WIP][PoC] Enable dataset usage for recognition task Mar 23, 2022

felixdittrich92 added 2 commits March 23, 2022 12:55

minor fixes

b7bb062

update

863e885

felixdittrich92 changed the title ~~[WIP][PoC] Enable dataset usage for recognition task~~ [PoC] Enable dataset usage for recognition task Mar 24, 2022

felixdittrich92 changed the title ~~[PoC] Enable dataset usage for recognition task~~ [datasets][PoC] Enable dataset usage for recognition task Mar 24, 2022

add missing module export

9b2200d

felixdittrich92 force-pushed the crop-pre branch from 6b79301 to 9b2200d Compare April 5, 2022 20:11

felixdittrich92 force-pushed the crop-pre branch from d3ab5ab to 9b2200d Compare April 7, 2022 06:36

felixdittrich92 added 7 commits April 7, 2022 11:30

move cropping to geometry

17af845

up

9ba303d

isort

993f1a9

fix synthtext performance

44e048b

up

ba4af0d

up

5d50fb6

up

31de3ae

felixdittrich92 mentioned this pull request Apr 7, 2022

[refactor][fix]: Part1 from use datasets for recognition task #889

Merged

fharper suggested changes Apr 8, 2022

View reviewed changes

felixdittrich92 marked this pull request as draft April 10, 2022 08:58

felixdittrich92 mentioned this pull request Apr 13, 2022

[feature] Part 2 from use datasets for recognition #891

Merged

charlesmindee closed this in #891 Apr 27, 2022

felixdittrich92 deleted the crop-pre branch April 27, 2022 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[datasets][PoC] Enable dataset usage for recognition task #867

[datasets][PoC] Enable dataset usage for recognition task #867

felixdittrich92 commented Mar 23, 2022 •

edited

Loading

codecov bot commented Mar 23, 2022 •

edited

Loading

felixdittrich92 commented Mar 30, 2022 •

edited

Loading

felixdittrich92 commented Apr 5, 2022

frgfm commented Apr 6, 2022

felixdittrich92 commented Apr 6, 2022 •

edited

Loading

felixdittrich92 commented Apr 7, 2022

fharper left a comment

felixdittrich92 commented Apr 10, 2022

[datasets][PoC] Enable dataset usage for recognition task #867

[datasets][PoC] Enable dataset usage for recognition task #867

Conversation

felixdittrich92 commented Mar 23, 2022 • edited Loading

codecov bot commented Mar 23, 2022 • edited Loading

Codecov Report

felixdittrich92 commented Mar 30, 2022 • edited Loading

felixdittrich92 commented Apr 5, 2022

frgfm commented Apr 6, 2022

felixdittrich92 commented Apr 6, 2022 • edited Loading

felixdittrich92 commented Apr 7, 2022

fharper left a comment

Choose a reason for hiding this comment

felixdittrich92 commented Apr 10, 2022

felixdittrich92 commented Mar 23, 2022 •

edited

Loading

codecov bot commented Mar 23, 2022 •

edited

Loading

felixdittrich92 commented Mar 30, 2022 •

edited

Loading

felixdittrich92 commented Apr 6, 2022 •

edited

Loading