-
Notifications
You must be signed in to change notification settings - Fork 152
In a multipage TIFF, results are returned only from the first page #76
Comments
Hm, not sure it can be considered a Pyocr problem. Looks like to me more like a Pillow (PIL.Image) limitation. |
I believe the underlying problem is tesseract c API. I opened this bug that was closed tesseract-ocr/tesseract#1138 Although the command line returns all the text, the capi only returns the last page (first page in case of pyocr). Does pyocr use |
When using Pyocr, the root problem is that the image has to be opened with Pillow (PIL.Image), and AFAIK, it doesn't support multi-pages tiff files at all. |
PIL seems to support multipage tiffs
for the one by one example, are you suggesting that PIL read a multipage tiff and then let us loop over each page and we send each page to pyocr? |
Oh wait, nevermind, it does support it |
Yeah, I think the problem is the way Pyocr is calling tesseract capi. We perhaps need to call a particular method in order to return the entire text. |
Actually, since it does support it, calling Image.seek() can solve your problem easily |
txt = ""
img = Image.open('multipage.tiff')
for frame in range(0, img.n_frames):
img.seek(frame)
txt += tool.image_to_string(
img
lang=lang,
builder=pyocr.builders.TextBuilder()
) |
ok, great! that seems to work. I will try to integrate it into my app. As a side note, how does |
For the command line tool, it looks on your PATH (See https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/tesseract.py#L379 and https://github.com/openpaperwork/pyocr/blob/master/src/pyocr/util.py#L25 ). So in your case, |
I'm going to close this issue. If you still have problems with multipage TIFF, don't hesitate to comment here again, and I'll reopen it. |
I'm discovering that libtesseract isn't found via Is there any way I can over ride this setting? |
My application is running on a PaaS and it might not be feasible for me to create a sum link. Is there a way to avoid a sum link? I already have the needed libtesseract.so.3.0.4 |
Actually I have the sym link but it is not in
Is there a way to change its location in pyocr? |
Again, when looking for libraries, Pyocr doesn't look for specific locations. It let the dynamic linker do the job. You may want to have a look at the environment variable defining where the dynamic linker look for libraries : http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html It does look for a specific library name however. Currently, it can't be changed without patching Pyocr. |
Yeah, I was loading tesseract like that too (which worked):
so i guess my dynamic linkers must not be working. I'll take a look. |
In a multipage tiff file, the results are returned only for the first page. This, however, works from the tesseract command line.
Here is an example of a multipage TIFF file: https://www.dropbox.com/s/qh72ec84su9zsj6/multipage.tiff?dl=0
shows
The text was updated successfully, but these errors were encountered: