TessBaseAPIProcessPages only processes information for the last page #1138

Omnipresent · 2017-09-17T20:26:19Z

Before you submit an issue, please review the guidelines for this repository.

Please report an issue only for a BUG, not for asking questions.

Note that it will be much easier for us to fix the issue if a test case that
reproduces the problem is provided. Ideally this test case should not have any
external dependencies. Provide a copy of the image or link to files for the test case.

Please delete this text and fill in the template below.

Environment

Tesseract Version: tesseract 3.04.00
Commit Number:
Platform: Linux 0ba52aafd58a 4.9.4-moby defect issue #1 SMP Wed Jan 18 17:04:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

When processing a multi page TIFF with Tesseract API in python, the text returned is only for the LAST page rather than for ALL pages.

Code used:

self.tesseract.TessBaseAPIProcessPages.argtypes = [POINTER(TessBaseAPI), c_char_p, c_char_p, c_int, POINTER(TessResultRenderer)]
self.tesseract.TessBaseAPIProcessPages.restype = c_bool
success = self.tesseract.TessBaseAPIProcessPages(self.api, create_string_buffer(path_to_multipage_tiff), None , 0, None)
ocr_r = self.tesseract.TessBaseAPIGetUTF8Text(self.api)
result = string_at(ocr_r)

Expected Behavior:

Instead of text returned only for the last page, the text should be returned for all pages.

Suggested Fix:

Interestingly this works when using the command line tesseract command. So perhaps there is already a fix for the command line.

The text was updated successfully, but these errors were encountered:

amitdo · 2017-09-18T06:14:31Z

Please report this issue in the repo of the python binding.

zdenop · 2017-09-18T06:40:47Z

...and the latest tesseract stable version is 3.05.01

Omnipresent · 2017-09-18T10:33:55Z

@zdenop this happens in 3.05.01 as well. I'm wondering whether calling TessBaseAPIProcessPages is the correct way to get ALL text from multi page pdfs? OR whether there is another recommended way

@amitdo I did not see python binding repo under tesseract. Can you please link me to the one you are referring to? Please note i'm using ctypes to call the acutal tesseract API and not the python wrapper for tesseract.

amitdo · 2017-09-18T10:51:36Z

I did not see python binding repo under tesseract. Can you please link me to the one you are referring to? Please note i'm using ctypes to call the acutal tesseract API

I thought that you are using a 3rd party python binding.
https://github.com/tesseract-ocr/tesseract/wiki/AddOns#tesseract-30x

amitdo · 2017-09-18T10:57:09Z

Interestingly this works when using the command line tesseract command. So perhaps there is already a fix for the command line

The command line uses the C++ API.

Omnipresent · 2017-09-18T10:57:37Z

@amitdo Sorry for the confusion, but I am not. I am referring to the actual tesseract capi. The capi can be called from python as shown in this example

tesseract/contrib/tesseract-c_api-demo.py

Line 72 in a75ab45

text_out = tesseract.TessBaseAPIProcessPages(api, filename, None , 0);

Note that the linked example is a bit out of date. An actual example that works is https://stackoverflow.com/a/36876584/44286

However, the issue still remains of how to call capi to get all the text from multi page TIFF instead of text from only the last page

zdenop · 2017-09-18T11:42:38Z

We do not provide support for 3rd party sw (e.g. python) => you need to be able replicate problem with C++ or C.
I created some examples of using API in python, but it is a little bit tricky sometimes and without official support...
Please use tesseract user forum instead (there are more people with more experiences)

amitdo · 2017-09-18T11:52:03Z

text = tesseract.TessBaseAPIGetUTF8Text(api)

This method returns text for one page only. The command line tool does not directly call this method.

You'll have to look in api/tesseractmain.cpp and mimic it to get things right.

Anyway, it's not an issue (bug) in tesseract command line or API.

Omnipresent · 2017-09-18T15:38:31Z

Thanks. I've posted this question on the user-forums here: https://groups.google.com/forum/#!topic/tesseract-ocr/AL9LzrHa97k

I will continue to dig into tesseractmain.cpp to see if something points out

amitdo · 2017-09-18T16:29:40Z

There is no direct way to get the text in all pages with ProcessPages.

if you give it a pointer to TessResultRenderer, the text is written to a file or stdout.

Omnipresent mentioned this issue Sep 17, 2017

Multi-page TIFF buffering is broken #233

Closed

zdenop closed this as completed Sep 18, 2017

Omnipresent mentioned this issue Sep 18, 2017

In a multipage TIFF, results are returned only from the first page openpaperwork/pyocr#76

Closed

otiai10 mentioned this issue Nov 5, 2018

Request for info: support for multi-page tiffs otiai10/gosseract#136

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TessBaseAPIProcessPages only processes information for the last page #1138

TessBaseAPIProcessPages only processes information for the last page #1138

Omnipresent commented Sep 17, 2017 •

edited

Loading

amitdo commented Sep 18, 2017

zdenop commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 •

edited

Loading

amitdo commented Sep 18, 2017

amitdo commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 •

edited

Loading

zdenop commented Sep 18, 2017

amitdo commented Sep 18, 2017 •

edited

Loading

Omnipresent commented Sep 18, 2017

amitdo commented Sep 18, 2017 •

edited

Loading

TessBaseAPIProcessPages only processes information for the last page #1138

TessBaseAPIProcessPages only processes information for the last page #1138

Comments

Omnipresent commented Sep 17, 2017 • edited Loading

Environment

Current Behavior:

Expected Behavior:

Suggested Fix:

amitdo commented Sep 18, 2017

zdenop commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 • edited Loading

amitdo commented Sep 18, 2017

amitdo commented Sep 18, 2017

Omnipresent commented Sep 18, 2017 • edited Loading

zdenop commented Sep 18, 2017

amitdo commented Sep 18, 2017 • edited Loading

Omnipresent commented Sep 18, 2017

amitdo commented Sep 18, 2017 • edited Loading

Omnipresent commented Sep 17, 2017 •

edited

Loading

Omnipresent commented Sep 18, 2017 •

edited

Loading

Omnipresent commented Sep 18, 2017 •

edited

Loading

amitdo commented Sep 18, 2017 •

edited

Loading

amitdo commented Sep 18, 2017 •

edited

Loading