Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle image and line regions in output formats ALTO, hOCR and text #3723

Merged
merged 1 commit into from
Feb 10, 2022

Conversation

stweil
Copy link
Member

@stweil stweil commented Jan 16, 2022

Signed-off-by: Stefan Weil sw@weilnetz.de

@stweil
Copy link
Member Author

stweil commented Jan 16, 2022

The size of the output files is significantly reduced by these changes (sizes in MiB for OCR result from 41998 images):

# old code
19489	alto
16838	hocr
844	txt

# new code
18205	alto
15264	hocr
827	txt

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jan 16, 2022 via email

@stweil
Copy link
Member Author

stweil commented Jan 16, 2022

Will adding a placeholder in the text output for images and lines be helpful?

Personally I don't need such a placeholder. If I want image information, I use ALTO or hOCR.

But if there are good reasons for a placeholder, it could be implemented as an optional parameter which defines the placeholder text. The default would be an empty string which would not be written. Or the implementation could use a boolean parameter which would be set to true to write a fixed placeholder.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@MerlijnWajer
Copy link
Contributor

I'd say a placeholder would be confusing, it's too easy to confuse it with actual text.

@MerlijnWajer
Copy link
Contributor

I have tried this code on a few books and it seems to work nicely.

In the meantime Aram developed a rudimentary method to create epubs from the hOCR output (and the input images), which can read ocr_photo tags and add them to the epub. This is IMHO a good way to validate that the hOCR ocr_photo elements also make sense.

You can find one such an epub here: https://archive.org/download/sim_english-illustrated-magazine_1884-12_2_15/sim_english-illustrated-magazine_1884-12_2_15.epub
(Other files likes images or the hOCR file can be found in the directory index: https://archive.org/download/sim_english-illustrated-magazine_1884-12_2_15)

The ocr_photo boxes in this particular epub are not that great (still it does a decent job if you disregard the page edges that it finds as photos), but that is because the input images were not cropped to the pages, so we can't really blame Tesseract for that.

@stweil
Copy link
Member Author

stweil commented Feb 9, 2022

What about this pull request? Can I merge it? I think it makes the Tesseract OCR output more useful and would like to include it in a release 5.1.0.

Should TSV output ignore images and line separators? If yes, I can add a commit which does that, too.
Or should the TSV output include images and line separators? If yes, how should they be encoded?

@MerlijnWajer
Copy link
Contributor

I would be happy if this makes it into a 5.1.0 release, but I suppose you weren't asking me. In any case, the code seems to work well for me, so you could add my sign off or tested by if that's useful / helpful.

@stweil stweil merged commit 424b17f into tesseract-ocr:main Feb 10, 2022
@stweil stweil deleted the output-formats branch February 10, 2022 13:27
@stweil
Copy link
Member Author

stweil commented Feb 10, 2022

Thank you. I added your Tested-by and merged the commit now.

As I don't know how the TSV output is used I won't change that for 5.1.0, but it can be changed any time later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants