-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speckled Documents Create Psychological Case for Tesseract #431
Comments
Wow, this is really extreme! |
Well, I tested it and it takes less than 5 minutes... Tesseract (the official command line tool) does not accept pdf as input, so how did you convert the pdf to a format that Tesseract accepts? Here is what I did:
This command will create 5 'gov-n.png' images. First page:
time: 1 minute and 5 seconds |
One minute per page is not extraordinary much (although improvements which make it faster are of course welcome). My worst cases are currently double pages from a historic newspaper which take around ten minutes. |
Thanks for looking at this! We converted using ghostscript to multi-page tiff:
One minute/page is still pretty darned slow, but we'd welcome that at this point! |
You could use gs to split the pdf into images and then ocr each separately On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:
|
Sure, but that's not the point...and anyway, it's not at all clear that the slowness is because it's a multipage tiff. I suspect if you ran this on each individual page of the tiff you'd have the same slowness. |
To get accurate results, you will need to preprocess the images too to get You could try scantailor or imagemagick. As a test, you can also try Vietocr GUI, and compare results with the On 23 Sep 2016 11:35 p.m., Shree wrote: You could use gs to split the pdf into images and then ocr each separately On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:
|
Your command creates a 730 MB tiff file, while my command creates 5 200-300 kB png files. |
Yeah, we saw this in testing, but went with TIFFs because they support multi-page images, which makes our OCR pipeline easier. In testing, we saw that the OCR for PDFs was no slower using large TIFFs than it was using PNGs because the process seems to be CPU bound no matter what. If you use 300dpi PNGs do you get the slow performance I experienced with the 300dpi TIFFs? That's probably a better test, right? |
This command creates a 42 MB tiff file. The size in pixels of each page is the same as with my PNGs. It takes 4 minutes and 29 seconds to Tesseract to read this tiff. |
@Shreeshrii commented:
I'm guessing that it will run faster too. BTW, Here is what Tesseract outputs in the console:
It 'thinks' the speckles are diacritics... |
Thanks for looking at this @amitdo.
But these aren't 300x300, which is apparently what provides the best OCR quality.[1] The point of this issue is that at 300x300, this takes seven hours to do five pages.
Yeah...that's an issue too. Running a despeckling filter first would help in this case, but we do OCR on millions of PDFs and we only need to despeckle the worst of them. For the rest, I imagine it would reduce quality (not to mention slow down the pipeline). The point here is that Tesseract takes seven hours for a speckled document at the recommended DPI. [1]: Some references:
|
This PDF file is just a bag of images. This is very common and was probably produced by gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -o gov2.tiff gov.pdf That said, best practice for known 'bag of images' PDF is not to render anything. It is to I just want to put this here because there are several different references being cited in this It does not however address the core question about dots, which seems like a legitimate |
@mlissner It would have been helpful, if you had shared the info about your previous tests for this type of document http://stackoverflow.com/questions/39110300/how-to-provide-image-to-tesseract-from-memory |
We have seen similar documents taking very long (but still not an hour per page!). Therefore, whether it really is a tesseract issue should be investigated further. @mlissner in order to increase performance and quality, you have to pre-process the image(s) for tesseract. For your specific case, use Look how tesseract uses leptonica and CCs e.g., |
This command will upscale the original images. It will make them more than 4 times larger. This is unnecessary because the DPI of the original images inside the pdf is 300X300, although the pdf itself falsely 'claims' that the DPI for these images is 72X72. |
I was just encouraging -sDEVICE=tiffg4 over -sDEVICE=tiffgray for known black and white images. You are right, care should be taken to avoid rescaling, and that's the primary reason image extraction is safer than rendering. |
@mlissner |
Lots of responses here, so let me try to respond to as many as I can. I considered using
But setting that aside, it seems like using @jbreiden you also say:
This feels wrong to me. In my experience, PDFs are a terrible source of ground truth. I'd expect the header information in the images to be much more accurate than whatever a PDF was reporting. You've provided a lot of information here already, but can you explain why we'd prefer the PDF data over the image data? @vidiecan: I'll look into counting connected components. Seems like a great way to solve this, if it performs well enough. Thanks for this suggestion. @Shreeshrii: I looked at PDF Sandwich, but didn't see anything useful. Do you know the code well enough to point me towards the image conversion part? |
In this buggy broken world, do whatever it takes to get the resolution right. I rescind my recommendation to honor the PDF settings. If you crack open gov.uscourts.ctd.18812.88.0.pdf, you can see that it really does contain black and white images. The telltale is BitsPerComponent 1 and the internal use of CCITTFaxDecode, which only works on black and white.
http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter The embedded black and white image inside the PDF is already dithered. Ghostscript is innocent. Normally I prefer to feed Tesseract images that have been messed with as little as possible, but this may just be the exception. Tesseract is not trained on dithered text. Good luck with this one! |
If you choose to use morphology to remove the dots and undo the dither, Leptonica is very strong library for C or C++ programmers. A few morphology operations (erosions and dilations) hopefully would do the trick. |
@jbreiden, do you know the image formats supported by Tesseract? |
Leptonica is responsible for decoding image file formats. The list of supported formats is here. Discard PDF (IFF_LPDF) and PS (IFF_LS ) because they are write-only, and discard SPIX because it is Leptonica specific. This support assumes that Leptonica is built with all imaging dependencies, which are optional. If you are running the Tesseract that ships on linux distributions such as Debian or Ubuntu, there should be no problems. You might have less support on cygwin or similar, depending on how Leptonica was built. https://github.com/DanBloomberg/leptonica/blob/master/src/imageio.h#L92 |
.--. .-. --- -..- .. -- .- / -.-. . -. - .- ..- .-. .. / -...
I've made a few in-place edits on the bug to
clarify the wording. Hopefully makes more
sense now.
... . -. -.. / ... .--. .- -.-. . / -- .- .-. .. -. . ...
|
I deleted my previous message just after you made the edits. I thought that you didn't like my little joke... For the benefit of humankind, here it is again...
<
Jeff, we are coming, stay calm! LOL |
It's good to know Morse code, or maybe just to find an online Morse code translator... :) |
Even if you use |
I just did some simple timings on this. The good:
The bad:
The hmmm:
Our bottleneck on our OCR server is CPU, so it's actually preferable for us to generate big files that use less CPU than to make small files that don't. OTOH, RAM is expensive, so we'll probably be switching this out. Thanks for the suggestion! |
For generating many 1-page images files, instead of one multi-page tiff file, use |
My first patch (dated March 28) in bug #233 will reduce RAM use with TIFF input. It stops Tesseract from buffering the input file before decompression. The patch should also should make the LZW case equal to the non-LZW case with respect to RAM. Note that I haven't tested on this particular example, so I'm saying "should" rather than "does". |
Jeff, On 4 Oct 2016 9:30 p.m., "jbreiden" notifications@github.com wrote:
|
The plan was for Ray to commit that patch. However, he has been too busy with the upcoming Tesseract 4.0 and over six months have passed. I think it is okay if someone wants to commit the patch. Please do not commit the second patch, though; that should wait until after the next Leptonica release. |
Jeff patch was applied, so closing this issue. If the issue exists in current code, please create new issues. |
Tesseract just spent seven hours trying to do OCR on the attached document. It's five pages long.
I'm fairly certain that the reason this takes so long is because of the speckling in the document. Other times when I've seen this kind of performance, it's been for similarly speckled documents.
Not sure what you can or should do about it, but since it seems to be a worst case scenario for Tesseract, I thought I'd report it.
This is on the latest version of Tesseract.
gov.uscourts.ctd.18812.88.0.pdf
The text was updated successfully, but these errors were encountered: