Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

In a multipage TIFF, results are returned only from the first page #76

Closed
Omnipresent opened this issue Sep 18, 2017 · 16 comments
Closed
Labels

Comments

@Omnipresent
Copy link

In a multipage tiff file, the results are returned only for the first page. This, however, works from the tesseract command line.

Here is an example of a multipage TIFF file: https://www.dropbox.com/s/qh72ec84su9zsj6/multipage.tiff?dl=0

txt = tool.image_to_string(
Image.open('multipage.tiff'),
lang=lang,
builder=pyocr.builders.TextBuilder()
)
print txt

shows

This is page one

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

Hm, not sure it can be considered a Pyocr problem. Looks like to me more like a Pillow (PIL.Image) limitation.

@Omnipresent
Copy link
Author

I believe the underlying problem is tesseract c API. I opened this bug that was closed tesseract-ocr/tesseract#1138

Although the command line returns all the text, the capi only returns the last page (first page in case of pyocr).

Does pyocr use TextRenderer to return the text results from the image or does it use TessBaseAPIGetUTF8Text ?

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

When using Pyocr, the root problem is that the image has to be opened with Pillow (PIL.Image), and AFAIK, it doesn't support multi-pages tiff files at all.
If it would, Pyocr or you could simply send the pages one by one to Tesseract (shell or libtesseract).

@Omnipresent
Copy link
Author

Omnipresent commented Sep 18, 2017

PIL seems to support multipage tiffs

 >>> tiffstack = Image.open('multipage.tiff')
 >>> tiffstack.load()
 <PixelAccess object at 0x7fc76bf1dab0>
 >>> print(tiffstack.n_frames)
 2

for the one by one example, are you suggesting that PIL read a multipage tiff and then let us loop over each page and we send each page to pyocr?

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

Oh wait, nevermind, it does support it

@Omnipresent
Copy link
Author

Oh wait, nevermind, it does support it

Yeah, I think the problem is the way Pyocr is calling tesseract capi. We perhaps need to call a particular method in order to return the entire text.

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

Actually, since it does support it, calling Image.seek() can solve your problem easily

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

txt = ""
img = Image.open('multipage.tiff')
for frame in range(0, img.n_frames):
    img.seek(frame)
    txt += tool.image_to_string(
        img
        lang=lang,
        builder=pyocr.builders.TextBuilder()
    )

@Omnipresent
Copy link
Author

ok, great! that seems to work. I will try to integrate it into my app. As a side note, how does get_available_tools() in pyocr detect Tesseract? In my application for some reach I can't execute tesseract command line but I do have /usr/local/lib/libtesseract.so.3.0.4. Will that be ok for pyocr?

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

how does get_available_tools() in pyocr detect Tesseract?

For the command line tool, it looks on your PATH (See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/tesseract.py#L379 and https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/util.py#L25 ).
For the library libtesseract, it tries to load one or two library names using the standard library loading mechanism ( See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/libtesseract/tesseract_raw.py#L39 + http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html ).

So in your case, libtesseract.so.3.0.4 should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4.

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

I'm going to close this issue. If you still have problems with multipage TIFF, don't hesitate to comment here again, and I'll reopen it.

@jflesch jflesch closed this as completed Sep 18, 2017
@Omnipresent
Copy link
Author

So in your case, libtesseract.so.3.0.4 should work fine as long as you also have a symbolic link libtesseract.so.3 --> libtesseract.so.3.0.4

I'm discovering that libtesseract isn't found via get_available_tools() for me. I have /usr/local/lib/libtesseract.so.3.0.4. Do I also have a symbolic link /usr/local/lib/libtesseract.so.3 --> /usr/local/lib/libtesseract.so.3.0.4?

Is there any way I can over ride this setting?

@Omnipresent
Copy link
Author

My application is running on a PaaS and it might not be feasible for me to create a sum link. Is there a way to avoid a sum link? I already have the needed libtesseract.so.3.0.4

@Omnipresent
Copy link
Author

Omnipresent commented Sep 18, 2017

Actually I have the sym link but it is not in /usr/local/lib

vcap@63~$ ls -al /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3
lrwxrwxrwx 1 vcap vcap 21 Jan  7  2017 /home/vcap/app/.heroku/vendor/lib/libtesseract.so.3 -> libtesseract.so.3.0.4

Is there a way to change its location in pyocr?

@jflesch
Copy link
Member

jflesch commented Sep 18, 2017

Again, when looking for libraries, Pyocr doesn't look for specific locations. It let the dynamic linker do the job. You may want to have a look at the environment variable defining where the dynamic linker look for libraries : http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html

It does look for a specific library name however. Currently, it can't be changed without patching Pyocr.

@Omnipresent
Copy link
Author

Yeah, I was loading tesseract like that too (which worked):

libname = '/home/vcap/app/.heroku/vendor/lib/libtesseract.so.3'
self.tesseract = cdll.LoadLibrary(libname)

so i guess my dynamic linkers must not be working. I'll take a look.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants