Troublesome Documents #1233

BillyCroan · 2024-01-14T23:43:19Z

BillyCroan
Jan 14, 2024

Is there a repository for 'troublesome documents'? I find often that North-American gas pump receipts recognize poorly. It's something about the dot in the zero that throws it off. They're a somewhat rough looking monospace font. I'd be happy to share an example file. And I was thinking, Maybe there's a repository of wayward scans for devs to use in testing.

jbarlow83 · 2024-01-15T00:47:27Z

jbarlow83
Jan 15, 2024
Maintainer

That sounds like a pure OCR problem, in which case you'll want to report it with Tesseract or try the ocrmypdf-easyocr plugin to see if it does any better. I look problematic at incoming PDFs that need OCR.

2 replies

BillyCroan Jan 15, 2024
Author

You look problematic at?

The 0 in text like this often gets interpreted as an 8 or a 6 or a B.

Is this the pure OCR problem to which you refer?

I don't see easyocr in Debian, nor any packages containing jaided. sounds like ocrmypdf can use either the tesseract or easyocr engine? I did try several engines and none recognized that text. It's not like a captcha though. I've even seen ocrmypdf recognize handwriting a couple times.

jbarlow83 Jan 15, 2024
Maintainer

Yes, that's a pure OCR problem - the OCR engine isn't interpreting it correctly, and OCRmyPDF just puts that incorrect input into a PDF. At least part of the issue may be resolution - the DPI will need to be set quite low, because OCR likely isn't trained to finding large fonts with heavily pixelated edges.

Tesseract could likely be fine-tuned to learn to read the 0 in this font. There's instructions in its documentation.

EasyOCR plugin: https://github.com/ocrmypdf/OCRmyPDF-EasyOCR
EasyOCR itself is just on GitHub and PyPI.

BillyCroan · 2024-01-16T02:44:16Z

BillyCroan
Jan 16, 2024
Author

I scan to 300dpi pnm files, then use IM convert to trim off the blackspace around the paper document, saving to a jpeg. Then I use img2pdf to concatenate the front and back pages into one pdf and pipe that to ocrmypdf. (A great ideal I stole from your wiki :-)

Should I be telling ocrmypdf in that pipe that the dpi is 300? I think DPI "looses meaning" when the input is a pdf. something like pdfs don't have dpi's, but their image content might. no matter if the whole purpose of a specific pdf is to contain full-page images....

Actually i just tried adding --image-dpi to my ocrmypdf command and it told me:
Argument --image-dpi is being ignored because the input file is a PDF, not an image.

Should I skip img2pdf then and process each image through ocrmypdf separately and find a way to concatenate those processed pages together later? Or is there a better way?

Thanks for the tip on EasyOCR too. I was looking for a better engine. EasyOCR That detects it fine, though a bit slower but what do you want for AI amiright? I might still prefer to use the standard tesseract method if setting a dpi is all that's needed.

I can't tell from your message if you suspect my scan dpi is too fine or too coarse. I don't want to loose resolution in the final pdfs I keep, but is there a way to drop their resolution (if that's what you're suggesting I do) only during ocrmypdf for what it sent to tesseract? (like --clean)

I see " --oversample " in the manpage. Do I need an "--undersample"?

1 reply

jbarlow83 Jan 16, 2024
Maintainer

For a PDF, the DPI is calculated from pixel dimensions of the image divided by the area over which the image is drawn. e.g. if you draw a 2550x3300 pixel over 8.5x11", the DPI is 2550/8.5=300, 3300/11=300.

When you use ocrmypdf or img2pdf to convert an image to PDF, both look at the image's metadata to see if a DPI is recorded there. If not, fallback options like --image-dpi are used. PNM files don't have metadata associated, so it may be that img2pdf is guessing the image resolution incorrectly and then ocrmypdf accepts its determination. I think that's what happened - img2pdf picks an incorrect default resolution, so tesseract gets incorrect information.

BillyCroan · 2024-01-16T03:13:43Z

BillyCroan
Jan 16, 2024
Author

ahh. I just assumed pnm was a higher "quality" image. like raw files from a camera. I'm scanning with scanimage

scanimage: scanning image of size 2548x4199 pixels at 24 bits/pixel

I think it can save directly to jpg. I originally also wanted pnms to have a high quality original I could downconvert to a large number of combinations of color spaces and formats to see what got me the best bang for my filespace buck. RGB jpg won

I just retried scanning to jpg and skipping the convert from pnm to jpg. Only doing the "convert" to trim blank space off. Maybe the dpi is getting lost when I trim them, or during imgpdf, but it's still bad, a littleworse actualy.

I don't see where the original jpg file from the scanimage process has a dpi:

identify raw/20240115_210132front.jpg
raw/20240115_210132front.jpg JPEG 2548x4199 2548x4199+0+0 8-bit sRGB 289138B 0.000u 0:00.000

scanimage manpage doesn't seem to mention DPIs.... not a good sign

1 reply

jbarlow83 Jan 16, 2024
Maintainer

img2pdf has options to set DPI when creating the PDF

BillyCroan · 2024-01-16T15:41:19Z

BillyCroan
Jan 16, 2024
Author

Adding -s 300dpix300dpi to img2pdf seems to have improved things a little. I'd love if there was a note in your wiki to specify -s if your images don't have dpi or dimension information.

I checked and all my old pdfs had way too large dimensions "7.10 × 19.42 inch" for a receipt. Most of that paper is gone now too :-(

with -s 300dpix300dpi in my pipeline, pdf properties now show 2.29 × 6.18 inch which is much more likely. I am going to assume that knowing the xy dimensions in pixels as well as the xy dimensions in inches is equivalent to knowing just one of those and the DPI.

The character recognition is still bad though. (did not noticeably improve at all) And while I know it's kind of an ugly font, it's a very mechinized font. like MICR codes. Fixed width, a dot inside the zeros and not inside the O's. I'd almost expect this to be more reliable to OCR than times new roman or Helvetica or something with variable width. I would think that mechanized fonts (don't know if there's a better name for fixed-width fonts that take care to make every character very distinct) would be some of the easiest to OCR but I'm probably going to use this EasyOCR AI software to get it done.

0 replies

BillyCroan · 2024-01-17T04:07:14Z

BillyCroan
Jan 17, 2024
Author

I installed the easyocr version via pipx and I went to compare a bunch of files between the two versions and found that while easyocr is more accurate at getting the letters right, the sidecar is all one line.

If I pdftotext the pdf it comes out on multiple lines but the sidecar is jacked. should I file that as a bug on the easyocr fork?

1 reply

jbarlow83 Jan 17, 2024
Maintainer

Yes that sounds like a bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Troublesome Documents #1233

{{title}}

Replies: 5 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Troublesome Documents #1233

BillyCroan Jan 14, 2024

Replies: 5 comments · 5 replies

jbarlow83 Jan 15, 2024 Maintainer

BillyCroan Jan 15, 2024 Author

jbarlow83 Jan 15, 2024 Maintainer

BillyCroan Jan 16, 2024 Author

jbarlow83 Jan 16, 2024 Maintainer

BillyCroan Jan 16, 2024 Author

jbarlow83 Jan 16, 2024 Maintainer

BillyCroan Jan 16, 2024 Author

BillyCroan Jan 17, 2024 Author

jbarlow83 Jan 17, 2024 Maintainer

BillyCroan
Jan 14, 2024

Replies: 5 comments 5 replies

jbarlow83
Jan 15, 2024
Maintainer

BillyCroan Jan 15, 2024
Author

jbarlow83 Jan 15, 2024
Maintainer

BillyCroan
Jan 16, 2024
Author

jbarlow83 Jan 16, 2024
Maintainer

BillyCroan
Jan 16, 2024
Author

jbarlow83 Jan 16, 2024
Maintainer

BillyCroan
Jan 16, 2024
Author

BillyCroan
Jan 17, 2024
Author

jbarlow83 Jan 17, 2024
Maintainer