-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle image and line regions in output formats ALTO, hOCR and text #3723
Conversation
The size of the output files is significantly reduced by these changes (sizes in MiB for OCR result from 41998 images):
|
// Ignore images and lines for text output.
Will adding a placeholder in the text output for images and lines be
helpful?
|
Personally I don't need such a placeholder. If I want image information, I use ALTO or hOCR. But if there are good reasons for a placeholder, it could be implemented as an optional parameter which defines the placeholder text. The default would be an empty string which would not be written. Or the implementation could use a boolean parameter which would be set to true to write a fixed placeholder. |
Signed-off-by: Stefan Weil <sw@weilnetz.de>
837d29b
to
5787667
Compare
I'd say a placeholder would be confusing, it's too easy to confuse it with actual text. |
I have tried this code on a few books and it seems to work nicely. In the meantime Aram developed a rudimentary method to create epubs from the hOCR output (and the input images), which can read You can find one such an epub here: https://archive.org/download/sim_english-illustrated-magazine_1884-12_2_15/sim_english-illustrated-magazine_1884-12_2_15.epub The ocr_photo boxes in this particular epub are not that great (still it does a decent job if you disregard the page edges that it finds as photos), but that is because the input images were not cropped to the pages, so we can't really blame Tesseract for that. |
What about this pull request? Can I merge it? I think it makes the Tesseract OCR output more useful and would like to include it in a release 5.1.0. Should TSV output ignore images and line separators? If yes, I can add a commit which does that, too. |
I would be happy if this makes it into a 5.1.0 release, but I suppose you weren't asking me. In any case, the code seems to work well for me, so you could add my sign off or tested by if that's useful / helpful. |
Thank you. I added your Tested-by and merged the commit now. As I don't know how the TSV output is used I won't change that for 5.1.0, but it can be changed any time later. |
Signed-off-by: Stefan Weil sw@weilnetz.de