In a multipage TIFF, results are returned only from the first page #76

Omnipresent · 2017-09-18T17:59:56Z

In a multipage tiff file, the results are returned only for the first page. This, however, works from the tesseract command line.

Here is an example of a multipage TIFF file: https://www.dropbox.com/s/qh72ec84su9zsj6/multipage.tiff?dl=0

txt = tool.image_to_string(
Image.open('multipage.tiff'),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
print txt

shows

This is page one

jflesch · 2017-09-18T18:18:49Z

Hm, not sure it can be considered a Pyocr problem. Looks like to me more like a Pillow (PIL.Image) limitation.

Omnipresent · 2017-09-18T18:23:43Z

I believe the underlying problem is tesseract c API. I opened this bug that was closed tesseract-ocr/tesseract#1138

Although the command line returns all the text, the capi only returns the last page (first page in case of pyocr).

Does pyocr use TextRenderer to return the text results from the image or does it use TessBaseAPIGetUTF8Text ?

jflesch · 2017-09-18T18:27:21Z

When using Pyocr, the root problem is that the image has to be opened with Pillow (PIL.Image), and AFAIK, it doesn't support multi-pages tiff files at all.
If it would, Pyocr or you could simply send the pages one by one to Tesseract (shell or libtesseract).

Omnipresent · 2017-09-18T18:30:14Z

PIL seems to support multipage tiffs

 >>> tiffstack = Image.open('multipage.tiff')
 >>> tiffstack.load()
 <PixelAccess object at 0x7fc76bf1dab0>
 >>> print(tiffstack.n_frames)
 2

for the one by one example, are you suggesting that PIL read a multipage tiff and then let us loop over each page and we send each page to pyocr?

jflesch · 2017-09-18T18:30:17Z

Oh wait, nevermind, it does support it

Omnipresent · 2017-09-18T18:31:22Z

Oh wait, nevermind, it does support it

Yeah, I think the problem is the way Pyocr is calling tesseract capi. We perhaps need to call a particular method in order to return the entire text.

jflesch · 2017-09-18T18:31:59Z

Actually, since it does support it, calling Image.seek() can solve your problem easily

jflesch · 2017-09-18T18:33:14Z

txt = ""
img = Image.open('multipage.tiff')
for frame in range(0, img.n_frames):
    img.seek(frame)
    txt += tool.image_to_string(
        img
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )

Omnipresent · 2017-09-18T18:38:31Z

ok, great! that seems to work. I will try to integrate it into my app. As a side note, how does get_available_tools() in pyocr detect Tesseract? In my application for some reach I can't execute tesseract command line but I do have /usr/local/lib/libtesseract.so.3.0.4. Will that be ok for pyocr?

jflesch · 2017-09-18T18:45:27Z

how does get_available_tools() in pyocr detect Tesseract?

For the command line tool, it looks on your PATH (See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/tesseract.py#L379 and https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/util.py#L25 ).
For the library libtesseract, it tries to load one or two library names using the standard library loading mechanism ( See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/libtesseract/tesseract_raw.py#L39 + http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html ).

So in your case, libtesseract.so.3.0.4 should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4.

jflesch · 2017-09-18T19:10:29Z

I'm going to close this issue. If you still have problems with multipage TIFF, don't hesitate to comment here again, and I'll reopen it.

Omnipresent · 2017-09-18T19:19:03Z

So in your case, libtesseract.so.3.0.4 should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4

I'm discovering that libtesseract isn't found via get_available_tools() for me. I have /usr/local/lib/libtesseract.so.3.0.4. Do I also have a symbolic link /usr/local/lib/libtesseract.so.3 --> /usr/local/lib/libtesseract.so.3.0.4?

Is there any way I can over ride this setting?

Omnipresent · 2017-09-18T19:23:36Z

My application is running on a PaaS and it might not be feasible for me to create a sum link. Is there a way to avoid a sum link? I already have the needed libtesseract.so.3.0.4

Omnipresent · 2017-09-18T20:18:41Z

Actually I have the sym link but it is not in /usr/local/lib

vcap@63~$ ls -al /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3
lrwxrwxrwx 1 vcap vcap 21 Jan  7  2017 /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3 -> libtesseract.so.3.0.4

Is there a way to change its location in pyocr?

jflesch · 2017-09-18T20:27:37Z

Again, when looking for libraries, Pyocr doesn't look for specific locations. It let the dynamic linker do the job. You may want to have a look at the environment variable defining where the dynamic linker look for libraries : http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html

It does look for a specific library name however. Currently, it can't be changed without patching Pyocr.

Omnipresent · 2017-09-18T20:40:20Z

Yeah, I was loading tesseract like that too (which worked):

libname = '/home/vcap/app/.heroku/vendor/lib/libtesseract.so.3'
self.tesseract = cdll.LoadLibrary(libname)

so i guess my dynamic linkers must not be working. I'll take a look.

jflesch added the support label Sep 18, 2017

jflesch closed this as completed Sep 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In a multipage TIFF, results are returned only from the first page #76

In a multipage TIFF, results are returned only from the first page #76

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017 •

edited

Loading

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 •

edited

Loading

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017 •

edited

Loading

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 •

edited

Loading

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

In a multipage TIFF, results are returned only from the first page #76

In a multipage TIFF, results are returned only from the first page #76

Comments

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017 • edited Loading

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 • edited Loading

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017 • edited Loading

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 • edited Loading

jflesch commented Sep 18, 2017

Omnipresent commented Sep 18, 2017

jflesch commented Sep 18, 2017 •

edited

Loading

Omnipresent commented Sep 18, 2017 •

edited

Loading

jflesch commented Sep 18, 2017 •

edited

Loading

Omnipresent commented Sep 18, 2017 •

edited

Loading