Handle image and line regions in output formats ALTO, hOCR and text #3723

stweil · 2022-01-16T13:06:30Z

Signed-off-by: Stefan Weil sw@weilnetz.de

stweil · 2022-01-16T13:13:10Z

The size of the output files is significantly reduced by these changes (sizes in MiB for OCR result from 41998 images):

# old code
19489	alto
16838	hocr
844	txt

# new code
18205	alto
15264	hocr
827	txt

Shreeshrii · 2022-01-16T13:17:18Z

// Ignore images and lines for text output.

Will adding a placeholder in the text output for images and lines be helpful?

stweil · 2022-01-16T13:32:46Z

Will adding a placeholder in the text output for images and lines be helpful?

Personally I don't need such a placeholder. If I want image information, I use ALTO or hOCR.

But if there are good reasons for a placeholder, it could be implemented as an optional parameter which defines the placeholder text. The default would be an empty string which would not be written. Or the implementation could use a boolean parameter which would be set to true to write a fixed placeholder.

Signed-off-by: Stefan Weil <sw@weilnetz.de>

MerlijnWajer · 2022-02-02T14:04:41Z

I'd say a placeholder would be confusing, it's too easy to confuse it with actual text.

MerlijnWajer · 2022-02-02T20:14:39Z

I have tried this code on a few books and it seems to work nicely.

In the meantime Aram developed a rudimentary method to create epubs from the hOCR output (and the input images), which can read ocr_photo tags and add them to the epub. This is IMHO a good way to validate that the hOCR ocr_photo elements also make sense.

You can find one such an epub here: https://archive.org/download/sim_english-illustrated-magazine_1884-12_2_15/sim_english-illustrated-magazine_1884-12_2_15.epub
(Other files likes images or the hOCR file can be found in the directory index: https://archive.org/download/sim_english-illustrated-magazine_1884-12_2_15)

The ocr_photo boxes in this particular epub are not that great (still it does a decent job if you disregard the page edges that it finds as photos), but that is because the input images were not cropped to the pages, so we can't really blame Tesseract for that.

stweil · 2022-02-09T19:46:29Z

What about this pull request? Can I merge it? I think it makes the Tesseract OCR output more useful and would like to include it in a release 5.1.0.

Should TSV output ignore images and line separators? If yes, I can add a commit which does that, too.
Or should the TSV output include images and line separators? If yes, how should they be encoded?

MerlijnWajer · 2022-02-10T12:31:30Z

I would be happy if this makes it into a 5.1.0 release, but I suppose you weren't asking me. In any case, the code seems to work well for me, so you could add my sign off or tested by if that's useful / helpful.

stweil · 2022-02-10T13:31:20Z

Thank you. I added your Tested-by and merged the commit now.

As I don't know how the TSV output is used I won't change that for 5.1.0, but it can be changed any time later.

stweil mentioned this pull request Jan 16, 2022

RFC: Add option to include images in hOCR output #3710

Closed

Handle image and line regions in output formats ALTO, hOCR and text

5787667

Signed-off-by: Stefan Weil <sw@weilnetz.de>

stweil force-pushed the output-formats branch from 837d29b to 5787667 Compare January 18, 2022 15:17

stweil merged commit 424b17f into tesseract-ocr:main Feb 10, 2022

stweil deleted the output-formats branch February 10, 2022 13:27

amitdo mentioned this pull request Nov 7, 2022

PDF renderer: Tesseract inserts spaces for non-text blocks it finds #3957

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle image and line regions in output formats ALTO, hOCR and text #3723

Handle image and line regions in output formats ALTO, hOCR and text #3723

stweil commented Jan 16, 2022

stweil commented Jan 16, 2022 •

edited

Loading

Shreeshrii commented Jan 16, 2022 via email

stweil commented Jan 16, 2022

MerlijnWajer commented Feb 2, 2022

MerlijnWajer commented Feb 2, 2022

stweil commented Feb 9, 2022

MerlijnWajer commented Feb 10, 2022

stweil commented Feb 10, 2022

Handle image and line regions in output formats ALTO, hOCR and text #3723

Handle image and line regions in output formats ALTO, hOCR and text #3723

Conversation

stweil commented Jan 16, 2022

stweil commented Jan 16, 2022 • edited Loading

Shreeshrii commented Jan 16, 2022 via email

stweil commented Jan 16, 2022

MerlijnWajer commented Feb 2, 2022

MerlijnWajer commented Feb 2, 2022

stweil commented Feb 9, 2022

MerlijnWajer commented Feb 10, 2022

stweil commented Feb 10, 2022

stweil commented Jan 16, 2022 •

edited

Loading