-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Add option to include images in hOCR output #3710
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great that you addressed this highly desired feature!
I have some smaller comments. Please note that we plan to release 5.0.1 this week, and I would postpone reviewing and testing of the new feature until that release was published.
Signed-off-by: Merlijn Wajer <merlijn@wizzup.org>
4b662b9
to
dafe4ff
Compare
Thinking about this some more, I think I recall having some images where Tesseract thought the entire page was a photo, but also still found (legitimate) text on it. I'll find such an example -- if my recollection is correct then we definitely do want to add text elements inside the |
Here is one such an image: https://archive.org/~merlijn/tesseract-images/hocr-images/sim_canadian-medical-association-journal_1963-03-16_88_11_0003.jpg There are also two files in that directory ( https://archive.org/~merlijn/tesseract-images/hocr-images/ ), "skipword.hocr" is using the code from this PR as of writing, "noskipword.hocr" is when we keep It looks like tesseract's iterator is clever enough to not place the majority of the text under the floating image, so that would argue for indeed skipping text contents in |
While the hocr spec is vague in this regard, I would still strongly argue against |
If I understand the code of the PR, it just writes a HTML element of type
I would like to have as much non-text areas classified and tagged as possible. Tables, formulae, decorative elements (separators), drawings, copper engravings (hole page), wood cuts. Even if it's not very reliable, the heuristic guess of Tesseract ("there is something, but it's not typical text") is helpful. |
Ok, note that currently we only seem to output a few spaces for these areas, and nothing else. I think it depends mostly on how the Tesseract page iterator works. It might be that the current behaviour (before this PR) of outputting a
Right, there is are other block types like |
AFAIR it depends on the combination of There are other cases, which Tesseract seems not detecting, like large, extremely letterspaced letters, or single glyphs like page numbers. Then Tesseract either outputs nothing (not a text area, case 1), or tries OCR without recognition result (case 2) and outputs a word element with only a space in it. |
I get no text for |
And it is even right, it is some kind of photo (or a scan), so the hOCR output may look like this:
|
In https://github.com/stweil/tesseract/tree/output-formats I have an experimental implementation which works better for me and which also handles ALTO and text output. |
@MerlijnWajer, in pull request #3723 I handle image and line regions unconditionally. I tested it on a set of 41998 images. It produced the same text results, but correctly wrote image and line regions instead of text regions with words consisting of blanks only. Maybe you want to test that on a larger number of images. |
Looks good, thanks for picking this up and improving it. Not writing text regions with only blank words I think is another clear improvement. I will try to do a few test runs with the code coming week (and can provide a sign off if useful), but don't block on me if you feel it's ready to be merged. |
Meanwhile more than two weeks passed by. Are there new results from testing this pull request and also for #3723? |
Yes, that's right. |
#3723 was merged. @MerlijnWajer, thanks for your effort. |
Regarding polygon, the API has |
OCR-D uses I think Tesseract's hOCR and ALTO output could be enhanced to use polygons, too. |
polygon for geometric hulls of areas and polyline for e.g. baseline are more precise and flexible. We can always convert or interpolate to bbox or straight line but not into the other direction. Best would be Bezier curves which are supported by SVG and ImageMagick understands the SVG-syntax for Bezier curves. Potrace also returns SVG. But that's maybe overdone in 99% of the cases and too complex math for an average developer. |
Written in collaboration with Aram (see commit author).
This pull request adds support to output
ocr_photo
elements in the hOCR renderer.ocr_photo
is "Something that requires JPEG or PNG to be represented well" per the specification.ocr_image
would be for SVG content. Since that is hard to distinguish, let's just go toocr_photo
unconditionally.There are a few open questions/concerns that I can think of:
ocrx_word
elements insideocr_photo
? I think per the hOCR specification this is allowed, but Tesseract seems to just "find" a word of the same bounding box with nothing but spaces as characters, in our testing.ocr_photo
elements, should we skip generating theocr_line
when writing hOCR even if the hocr images option is turned off, since we know the element is detected as an image block type with (apparently) no actual content?ocrx_word
elements, is thegoto
acceptable, or would you rather see the code rewritten without thegoto
?