Troublesome Documents #1233
Replies: 5 comments 5 replies
-
That sounds like a pure OCR problem, in which case you'll want to report it with Tesseract or try the ocrmypdf-easyocr plugin to see if it does any better. I look problematic at incoming PDFs that need OCR. |
Beta Was this translation helpful? Give feedback.
-
I scan to 300dpi pnm files, then use IM convert to trim off the blackspace around the paper document, saving to a jpeg. Then I use img2pdf to concatenate the front and back pages into one pdf and pipe that to ocrmypdf. (A great ideal I stole from your wiki :-) Should I be telling ocrmypdf in that pipe that the dpi is 300? I think DPI "looses meaning" when the input is a pdf. something like pdfs don't have dpi's, but their image content might. no matter if the whole purpose of a specific pdf is to contain full-page images.... Actually i just tried adding --image-dpi to my ocrmypdf command and it told me: Should I skip img2pdf then and process each image through ocrmypdf separately and find a way to concatenate those processed pages together later? Or is there a better way? Thanks for the tip on EasyOCR too. I was looking for a better engine. EasyOCR That detects it fine, though a bit slower but what do you want for AI amiright? I might still prefer to use the standard tesseract method if setting a dpi is all that's needed. I can't tell from your message if you suspect my scan dpi is too fine or too coarse. I don't want to loose resolution in the final pdfs I keep, but is there a way to drop their resolution (if that's what you're suggesting I do) only during ocrmypdf for what it sent to tesseract? (like --clean) I see " --oversample " in the manpage. Do I need an "--undersample"? |
Beta Was this translation helpful? Give feedback.
-
ahh. I just assumed pnm was a higher "quality" image. like raw files from a camera. I'm scanning with scanimage
I think it can save directly to jpg. I originally also wanted pnms to have a high quality original I could downconvert to a large number of combinations of color spaces and formats to see what got me the best bang for my filespace buck. RGB jpg won I just retried scanning to jpg and skipping the convert from pnm to jpg. Only doing the "convert" to trim blank space off. Maybe the dpi is getting lost when I trim them, or during imgpdf, but it's still bad, a littleworse actualy. I don't see where the original jpg file from the scanimage process has a dpi:
scanimage manpage doesn't seem to mention DPIs.... not a good sign |
Beta Was this translation helpful? Give feedback.
-
Adding -s 300dpix300dpi to img2pdf seems to have improved things a little. I'd love if there was a note in your wiki to specify -s if your images don't have dpi or dimension information. I checked and all my old pdfs had way too large dimensions "7.10 × 19.42 inch" for a receipt. Most of that paper is gone now too :-( with -s 300dpix300dpi in my pipeline, pdf properties now show 2.29 × 6.18 inch which is much more likely. I am going to assume that knowing the xy dimensions in pixels as well as the xy dimensions in inches is equivalent to knowing just one of those and the DPI. The character recognition is still bad though. (did not noticeably improve at all) And while I know it's kind of an ugly font, it's a very mechinized font. like MICR codes. Fixed width, a dot inside the zeros and not inside the O's. I'd almost expect this to be more reliable to OCR than times new roman or Helvetica or something with variable width. I would think that mechanized fonts (don't know if there's a better name for fixed-width fonts that take care to make every character very distinct) would be some of the easiest to OCR but I'm probably going to use this EasyOCR AI software to get it done. |
Beta Was this translation helpful? Give feedback.
-
I installed the easyocr version via pipx and I went to compare a bunch of files between the two versions and found that while easyocr is more accurate at getting the letters right, the sidecar is all one line. If I pdftotext the pdf it comes out on multiple lines but the sidecar is jacked. should I file that as a bug on the easyocr fork? |
Beta Was this translation helpful? Give feedback.
-
Is there a repository for 'troublesome documents'? I find often that North-American gas pump receipts recognize poorly. It's something about the dot in the zero that throws it off. They're a somewhat rough looking monospace font. I'd be happy to share an example file. And I was thinking, Maybe there's a repository of wayward scans for devs to use in testing.
Beta Was this translation helpful? Give feedback.
All reactions