Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TessBaseAPIProcessPages only processes information for the last page #1138

Closed
Omnipresent opened this issue Sep 17, 2017 · 10 comments
Closed

Comments

@Omnipresent
Copy link

Omnipresent commented Sep 17, 2017

Before you submit an issue, please review the guidelines for this repository.

Please report an issue only for a BUG, not for asking questions.

Note that it will be much easier for us to fix the issue if a test case that
reproduces the problem is provided. Ideally this test case should not have any
external dependencies. Provide a copy of the image or link to files for the test case.

Please delete this text and fill in the template below.


Environment

  • Tesseract Version: tesseract 3.04.00
  • Commit Number:
  • Platform: Linux 0ba52aafd58a 4.9.4-moby defect issue #1 SMP Wed Jan 18 17:04:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Current Behavior:

When processing a multi page TIFF with Tesseract API in python, the text returned is only for the LAST page rather than for ALL pages.

Code used:

self.tesseract.TessBaseAPIProcessPages.argtypes = [POINTER(TessBaseAPI), c_char_p, c_char_p, c_int, POINTER(TessResultRenderer)]
self.tesseract.TessBaseAPIProcessPages.restype = c_bool
success = self.tesseract.TessBaseAPIProcessPages(self.api, create_string_buffer(path_to_multipage_tiff), None , 0, None)
ocr_r = self.tesseract.TessBaseAPIGetUTF8Text(self.api)
result = string_at(ocr_r)

Expected Behavior:

Instead of text returned only for the last page, the text should be returned for all pages.

Suggested Fix:

Interestingly this works when using the command line tesseract command. So perhaps there is already a fix for the command line.

@amitdo
Copy link
Collaborator

amitdo commented Sep 18, 2017

Please report this issue in the repo of the python binding.

@zdenop
Copy link
Contributor

zdenop commented Sep 18, 2017

...and the latest tesseract stable version is 3.05.01

@zdenop zdenop closed this as completed Sep 18, 2017
@Omnipresent
Copy link
Author

Omnipresent commented Sep 18, 2017

@zdenop this happens in 3.05.01 as well. I'm wondering whether calling TessBaseAPIProcessPages is the correct way to get ALL text from multi page pdfs? OR whether there is another recommended way

@amitdo I did not see python binding repo under tesseract. Can you please link me to the one you are referring to? Please note i'm using ctypes to call the acutal tesseract API and not the python wrapper for tesseract.

@amitdo
Copy link
Collaborator

amitdo commented Sep 18, 2017

I did not see python binding repo under tesseract. Can you please link me to the one you are referring to? Please note i'm using ctypes to call the acutal tesseract API

I thought that you are using a 3rd party python binding.
https://github.com/tesseract-ocr/tesseract/wiki/AddOns#tesseract-30x

@amitdo
Copy link
Collaborator

amitdo commented Sep 18, 2017

Interestingly this works when using the command line tesseract command. So perhaps there is already a fix for the command line

The command line uses the C++ API.

@Omnipresent
Copy link
Author

Omnipresent commented Sep 18, 2017

@amitdo Sorry for the confusion, but I am not. I am referring to the actual tesseract capi. The capi can be called from python as shown in this example

text_out = tesseract.TessBaseAPIProcessPages(api, filename, None , 0);
Note that the linked example is a bit out of date. An actual example that works is https://stackoverflow.com/a/36876584/44286

However, the issue still remains of how to call capi to get all the text from multi page TIFF instead of text from only the last page

@zdenop
Copy link
Contributor

zdenop commented Sep 18, 2017

We do not provide support for 3rd party sw (e.g. python) => you need to be able replicate problem with C++ or C.
I created some examples of using API in python, but it is a little bit tricky sometimes and without official support...
Please use tesseract user forum instead (there are more people with more experiences)

@amitdo
Copy link
Collaborator

amitdo commented Sep 18, 2017

text = tesseract.TessBaseAPIGetUTF8Text(api)

This method returns text for one page only. The command line tool does not directly call this method.

You'll have to look in api/tesseractmain.cpp and mimic it to get things right.

Anyway, it's not an issue (bug) in tesseract command line or API.

@Omnipresent
Copy link
Author

Thanks. I've posted this question on the user-forums here: https://groups.google.com/forum/#!topic/tesseract-ocr/AL9LzrHa97k

I will continue to dig into tesseractmain.cpp to see if something points out

@amitdo
Copy link
Collaborator

amitdo commented Sep 18, 2017

There is no direct way to get the text in all pages with ProcessPages.

if you give it a pointer to TessResultRenderer, the text is written to a file or stdout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants