Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speckled Documents Create Psychological Case for Tesseract #431

Closed
mlissner opened this issue Sep 22, 2016 · 35 comments
Closed

Speckled Documents Create Psychological Case for Tesseract #431

mlissner opened this issue Sep 22, 2016 · 35 comments

Comments

@mlissner
Copy link

Tesseract just spent seven hours trying to do OCR on the attached document. It's five pages long.

I'm fairly certain that the reason this takes so long is because of the speckling in the document. Other times when I've seen this kind of performance, it's been for similarly speckled documents.

Not sure what you can or should do about it, but since it seems to be a worst case scenario for Tesseract, I thought I'd report it.

This is on the latest version of Tesseract.

gov.uscourts.ctd.18812.88.0.pdf

@amitdo
Copy link
Collaborator

amitdo commented Sep 23, 2016

Tesseract just spent seven hours trying to do OCR on the attached document. It's five pages long.

Wow, this is really extreme!

@amitdo
Copy link
Collaborator

amitdo commented Sep 23, 2016

Well, I tested it and it takes less than 5 minutes...

Tesseract (the official command line tool) does not accept pdf as input, so how did you convert the pdf to a format that Tesseract accepts?

Here is what I did:

convert gov.uscourts.ctd.18812.88.0.pdf gov.png

This command will create 5 'gov-n.png' images.

First page:

tesseract gov-0.png gov-0

time: 1 minute and 5 seconds

@stweil
Copy link
Member

stweil commented Sep 23, 2016

One minute per page is not extraordinary much (although improvements which make it faster are of course welcome). My worst cases are currently double pages from a historic newspaper which take around ten minutes.

@mlissner
Copy link
Author

mlissner commented Sep 23, 2016

Thanks for looking at this! We converted using ghostscript to multi-page tiff:

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o destination path

One minute/page is still pretty darned slow, but we'd welcome that at this point!

@Shreeshrii
Copy link
Collaborator

You could use gs to split the pdf into images and then ocr each separately
and concatenate the result.

On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:

Thanks for looking at this! We converted using ghostscript to multi-page
tiff:

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o
destination path


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#431 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7sVanef-bvL1nJdyJAdBJ0L3-2jks5qtAwkgaJpZM4KEcsN
.

@mlissner
Copy link
Author

Sure, but that's not the point...and anyway, it's not at all clear that the slowness is because it's a multipage tiff. I suspect if you ran this on each individual page of the tiff you'd have the same slowness.

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Sep 23, 2016

To get accurate results, you will need to preprocess the images too to get
rid of the background speckles.

You could try scantailor or imagemagick.

As a test, you can also try Vietocr GUI, and compare results with the
command line output.

On 23 Sep 2016 11:35 p.m., Shree wrote:

You could use gs to split the pdf into images and then ocr each separately
and concatenate the result.

On 23 Sep 2016 10:58 p.m., "Mike Lissner" notifications@github.com wrote:

Thanks for looking at this! We converted using ghostscript to multi-page
tiff:

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE-sDEVICE=tiffgray -r300x300-o
destination path


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#431 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o7sVanef-bvL1nJdyJAdBJ0L3-2jks5qtAwkgaJpZM4KEcsN
.

@amitdo
Copy link
Collaborator

amitdo commented Sep 23, 2016

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r300x300 -o gov2.tiff gov.pdf

Your command creates a 730 MB tiff file, while my command creates 5 200-300 kB png files.

@mlissner
Copy link
Author

Yeah, we saw this in testing, but went with TIFFs because they support multi-page images, which makes our OCR pipeline easier. In testing, we saw that the OCR for PDFs was no slower using large TIFFs than it was using PNGs because the process seems to be CPU bound no matter what.

If you use 300dpi PNGs do you get the slow performance I experienced with the 300dpi TIFFs? That's probably a better test, right?

@amitdo
Copy link
Collaborator

amitdo commented Sep 23, 2016

gov-0

Image properties:
Width: 35.417 Height: 45.834
DPI: 72 X 72

This is equivalent to:
Width: 8.5 Height: 11
DPI: 300 X 300

@amitdo
Copy link
Collaborator

amitdo commented Sep 23, 2016

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -r72 -o gov.tiff gov.pdf
OR
gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffgray -o gov.tiff gov.pdf

This command creates a 42 MB tiff file. The size in pixels of each page is the same as with my PNGs.

It takes 4 minutes and 29 seconds to Tesseract to read this tiff.

@amitdo
Copy link
Collaborator

amitdo commented Sep 23, 2016

@Shreeshrii commented:

To get accurate results, you will need to preprocess the images too to get
rid of the background speckles.

I'm guessing that it will run faster too.

BTW, Here is what Tesseract outputs in the console:

time tesseract gov.tiff gov
Tesseract Open Source OCR Engine v3.05.00dev with Leptonica
Page 1
Detected 1875 diacritics
Page 2
Detected 1338 diacritics
Page 3
Detected 1885 diacritics
Page 4
Detected 658 diacritics
Page 5
Detected 213 diacritics

real    4m29.118s
user    4m28.972s
sys 0m0.152s

It 'thinks' the speckles are diacritics...

@mlissner
Copy link
Author

mlissner commented Sep 23, 2016

Thanks for looking at this @amitdo.

This command creates a 42 MB tiff file. The size in pixels of each page is the same as with my PNGs.

But these aren't 300x300, which is apparently what provides the best OCR quality.[1] The point of this issue is that at 300x300, this takes seven hours to do five pages.

It 'thinks' the speckles are diacritics...

Yeah...that's an issue too. Running a despeckling filter first would help in this case, but we do OCR on millions of PDFs and we only need to despeckle the worst of them. For the rest, I imagine it would reduce quality (not to mention slow down the pipeline).

The point here is that Tesseract takes seven hours for a speckled document at the recommended DPI.

[1]: Some references:

@jbreiden
Copy link
Contributor

This PDF file is just a bag of images. This is very common and was probably produced by
a photocopier or sheetfed scanner. Some fax machines make these too. It is entirely
black and white. If you know you are working with black and white images, you can
save a ton of space by using appropriate compression. This command renders 100%
equivalent images for 2.3MB.

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -o gov2.tiff gov.pdf

That said, best practice for known 'bag of images' PDF is not to render anything. It is to
extract the images, undisturbed. If necessary, adjust their header so that their resolution
(e.g. 300 dpi) agrees with what the PDF was claiming. In an ideal world they would always
be consistent already, but programmers screw this up all the time. That's the thing you
feed to Tesseract (assuming you don't want to do any additional cleaning or something.)
This workflow is kind of sophisticated and maybe not easy for everyone. But it makes
more sense than potentially rescaling the images by rendering to a different dpi.

I just want to put this here because there are several different references being cited in this
bug report about work flow. Please consider this one authoritative.

It does not however address the core question about dots, which seems like a legitimate
concern. This will be an interesting test document for future development.

@Shreeshrii
Copy link
Collaborator

@mlissner It would have been helpful, if you had shared the info about your previous tests for this type of document

http://stackoverflow.com/questions/39110300/how-to-provide-image-to-tesseract-from-memory

https://github.com/mlissner/tesseract-performance-testing

@vidiecan
Copy link

We have seen similar documents taking very long (but still not an hour per page!). Therefore, whether it really is a tesseract issue should be investigated further.

@mlissner in order to increase performance and quality, you have to pre-process the image(s) for tesseract. For your specific case, use leptonica (tesseract already depends on it). Count the connected components, if there are too many, apply your filters. In a real word application where your documents have specific characteristics, you will not be able to avoid heavy pre-processing for tesseract in order to achieve reasonable results.

Look how tesseract uses leptonica and CCs e.g.,
https://github.com/tesseract-ocr/tesseract/search?utf8=%E2%9C%93&q=pixConnComp

@amitdo
Copy link
Collaborator

amitdo commented Sep 24, 2016

@jbreiden

This command renders 100% equivalent images for 2.3MB.

gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -sDEVICE=tiffg4 -r300x300 -o gov2.tiff gov.pdf

This command will upscale the original images. It will make them more than 4 times larger. This is unnecessary because the DPI of the original images inside the pdf is 300X300, although the pdf itself falsely 'claims' that the DPI for these images is 72X72.

@jbreiden
Copy link
Contributor

I was just encouraging -sDEVICE=tiffg4 over -sDEVICE=tiffgray for known black and white images. You are right, care should be taken to avoid rescaling, and that's the primary reason image extraction is safer than rendering.

@Shreeshrii
Copy link
Collaborator

@mlissner
You could also look at the preprocessing workflow used by pdf sandwich
https://sourceforge.net/projects/pdfsandwich/

@amitdo
Copy link
Collaborator

amitdo commented Sep 25, 2016

@mlissner
Copy link
Author

mlissner commented Sep 26, 2016

Lots of responses here, so let me try to respond to as many as I can.

@amitdo and @jbreiden:

I considered using -sDEVICE=tiffg4 over -sDEVICE=tiffgray, but it's not purely black and white, and like I said, the bigger files don't seem to affect performance. Here's a comparison of a gray part of the original PDF:

tiffg4:
g4

tiffgray:
gray

tiffgray is definitely better for this, and since we're doing millions of files, it seems safer to use this approach than to assume all docs are purely black and white (even though it makes big files).

But setting that aside, it seems like using gs is the wrong approach regardless. Seems like the right approach is to extract the images undisturbed. Seems doable, but I'll have to do some research on this. Is it documented anywhere which image formats Tesseract supports natively? There's one question in StackOverflow that seems to address this, but otherwise I don't see a lot of guidance. I'm concerned that if we use the undisturbed images, we'll get weird image formats that Tesseract won't accept.

@jbreiden you also say:

If necessary, adjust their header so that their resolution agrees with what the PDF was claiming.

This feels wrong to me. In my experience, PDFs are a terrible source of ground truth. I'd expect the header information in the images to be much more accurate than whatever a PDF was reporting. You've provided a lot of information here already, but can you explain why we'd prefer the PDF data over the image data?

@vidiecan: I'll look into counting connected components. Seems like a great way to solve this, if it performs well enough. Thanks for this suggestion.

@Shreeshrii: I looked at PDF Sandwich, but didn't see anything useful. Do you know the code well enough to point me towards the image conversion part?

@jbreiden
Copy link
Contributor

jbreiden commented Sep 27, 2016

In this buggy broken world, do whatever it takes to get the resolution right. I rescind my recommendation to honor the PDF settings. If you crack open gov.uscourts.ctd.18812.88.0.pdf, you can see that it really does contain black and white images. The telltale is BitsPerComponent 1 and the internal use of CCITTFaxDecode, which only works on black and white.

<<
/Type /XObject
/Filter [/CCITTFaxDecode]
/Length 60 0 R
/Height 3300
/BitsPerComponent 1
/ColorSpace [/DeviceGray]
/DecodeParms [61 0 R]
/Subtype /Image
/Name /Im1
/Width 2550
>>

http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter

The embedded black and white image inside the PDF is already dithered. Ghostscript is innocent. Normally I prefer to feed Tesseract images that have been messed with as little as possible, but this may just be the exception. Tesseract is not trained on dithered text. Good luck with this one!

foo

@jbreiden
Copy link
Contributor

jbreiden commented Sep 27, 2016

If you choose to use morphology to remove the dots and undo the dither, Leptonica is very strong library for C or C++ programmers. A few morphology operations (erosions and dilations) hopefully would do the trick.

@mlissner
Copy link
Author

mlissner commented Sep 27, 2016

@jbreiden, do you know the image formats supported by Tesseract?

@jbreiden
Copy link
Contributor

jbreiden commented Sep 27, 2016

Leptonica is responsible for decoding image file formats. The list of supported formats is here. Discard PDF (IFF_LPDF) and PS (IFF_LS ) because they are write-only, and discard SPIX because it is Leptonica specific. This support assumes that Leptonica is built with all imaging dependencies, which are optional. If you are running the Tesseract that ships on linux distributions such as Debian or Ubuntu, there should be no problems. You might have less support on cygwin or similar, depending on how Leptonica was built.

https://github.com/DanBloomberg/leptonica/blob/master/src/imageio.h#L92

@jbreiden
Copy link
Contributor

jbreiden commented Sep 27, 2016 via email

@amitdo
Copy link
Collaborator

amitdo commented Sep 27, 2016

I've made a few in-place edits on the bug to clarify the wording. Hopefully makes more sense now.

I deleted my previous message just after you made the edits. I thought that you didn't like my little joke...
Clearly, I was wrong!

For the benefit of humankind, here it is again...

@jbreiden

Jeff, your last two messages look cryptic...

If you have been abducted by aliens, try give us a sign and we will rescue you! :)

<

.--. .-. --- -..- .. -- .- / -.-. . -. - .- ..- .-. .. / -...

... . -. -.. / ... .--. .- -.-. . / -- .- .-. .. -. . ...

Jeff, we are coming, stay calm!

LOL

@amitdo
Copy link
Collaborator

amitdo commented Sep 27, 2016

It's good to know Morse code, or maybe just to find an online Morse code translator... :)

@amitdo
Copy link
Collaborator

amitdo commented Sep 29, 2016

Even if you use -sDEVICE=tiffgray, you might want to use -sCompression=lzw.

@mlissner
Copy link
Author

mlissner commented Sep 29, 2016

You might want to use -sCompression=lzw.

I just did some simple timings on this.

The good:

  • Compressed Tiffs are about 1-2% the size of the uncompressed versions (in a test I just did, uncompressed was 137M while compressed was 1.8M!).
  • Using compressed files used about 30% of the RAM (92MB instead of 301MB according to time -v).
  • LZW is a lossless format, so Tesseract generated identical results.
  • It took Tesseract about the same amount of time to do either format.

The bad:

  • It takes about twice as long to generate compressed tiffs from PDFs (though this is only a fraction of the total time doing OCR).

The hmmm:

  • Making a compressed tiff moves the processing burden from disk (making a big file) to CPU (compressing a big file).

Our bottleneck on our OCR server is CPU, so it's actually preferable for us to generate big files that use less CPU than to make small files that don't. OTOH, RAM is expensive, so we'll probably be switching this out. Thanks for the suggestion!

@amitdo
Copy link
Collaborator

amitdo commented Sep 29, 2016

For generating many 1-page images files, instead of one multi-page tiff file, use -o img-%d.tiff.

@jbreiden
Copy link
Contributor

jbreiden commented Oct 4, 2016

My first patch (dated March 28) in bug #233 will reduce RAM use with TIFF input. It stops Tesseract from buffering the input file before decompression. The patch should also should make the LZW case equal to the non-LZW case with respect to RAM. Note that I haven't tested on this particular example, so I'm saying "should" rather than "does".

@Shreeshrii
Copy link
Collaborator

Jeff,
Why are we not commiting your patch from March?

On 4 Oct 2016 9:30 p.m., "jbreiden" notifications@github.com wrote:

My first patch (dated March 28) in this bug #233
#233 will reduce RAM
use in TIFF. It stops Tesseract from buffering the input file before
decompression. The patch should also should make the LZW case equal to the
non-LZW case with respect to RAM. Note that I haven't tested on this
particular example, so I'm saying "should" rather than "does".


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#431 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AE2_o26v-ubHM_MSiloIl-YAaOzocpMPks5qwqlWgaJpZM4KEcsN
.

@jbreiden
Copy link
Contributor

jbreiden commented Oct 5, 2016

The plan was for Ray to commit that patch. However, he has been too busy with the upcoming Tesseract 4.0 and over six months have passed. I think it is okay if someone wants to commit the patch. Please do not commit the second patch, though; that should wait until after the next Leptonica release.

@zdenop
Copy link
Contributor

zdenop commented Sep 27, 2018

Jeff patch was applied, so closing this issue. If the issue exists in current code, please create new issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants