-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-page TIFF buffering is broken #233
Comments
Please be careful with changes. It is very important not to break streaming support. https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-do-streaming |
Whoever makes the changes will have to make sure that all the tests in the test suite pass, just like any other change to the code base. Of course this wouldn't be necessary if, when streaming support was added, it was done with some recognition of the performance impact rather than just leaving comments like "To keep code simple we will also buffer data coming from a file." https://github.com/tesseract-ocr/tesseract/blob/master/api/baseapi.cpp#L1119 |
I plead guilty on all counts. I wrote the streaming feature, I failed to write a test, I introduced the performance regression, I wrote the comment, and either I didn't notice the performance regression or failed to properly consider it. I will attempt to write a test for streaming, and will work with you on TIFF. TIFF is historically tricky for two reasons. One is the duplicated functionality on the OpenCL path. Hard to synchronize and hard to test. I personally can't seem to run the OpenCL path at all without a segfault. Second, under win32 is it hard or impossible to pass a file descriptor between different DLLs. Which is awkward because the libtiff API prefers to work with file descriptors instead of file pointers. |
@jbreiden I don't see any particular reason that your feature should be the first to have a test. As far as I can tell, there are no tests at all for the entire program. Can you point to a more complete description of the file descriptor issue? Does TIFFClientOpen help at all? Some random, possibly relevant, tidbits:
|
I was intimately familiar with the topic a decade ago. I have http://www.asmail.be/msg0054669449.html It's nice that the filename is already being passed around; You are right, the streaming feature is super narrow. But |
@tfmorris How about something like this? If you are happy with it, I'll hand this change over to Ray for review and eventual inclusion. No API changes. I think functionally it is exactly the same, except that a multipage TIFF from a file does not get buffered. --- api/baseapi.cpp 2016-03-11 14:29:36.000000000 -0800
+++ api/baseapi.cpp 2016-03-28 15:49:06.000000000 -0700
@@ -1034,11 +1034,14 @@
page = tessedit_page_number;
#ifdef USE_OPENCL
if ( od.selectedDeviceIsOpenCL() ) {
- // FIXME(jbreiden) Not implemented.
- pix = od.pixReadMemTiffCl(data, size, page);
+ pix = (data) ?
+ od.pixReadMemTiffCl(data, size, page) :
+ od.pixReadTiffCl(filename, page);
} else {
#endif
- pix = pixReadMemTiff(data, size, page);
+ pix = (data) ?
+ pixReadMemTiff(data, size, page) :
+ pixReadTiff(filename, page);
#ifdef USE_OPENCL
}
#endif
@@ -1086,8 +1089,7 @@
// makes automatic detection of datatype (TIFF? filelist? PNG?)
// impractical. So we support a command line flag to explicitly
// identify the scenario that really matters: filelists on
-// stdin. We'll still do our best if the user likes pipes. That means
-// piling up any data coming into stdin into a memory buffer.
+// stdin. We'll still do our best if the user likes pipes.
bool TessBaseAPI::ProcessPagesInternal(const char* filename,
const char* retry_config,
int timeout_millisec,
@@ -1109,31 +1111,24 @@
}
// At this point we are officially in autodection territory.
- // That means we are going to buffer stdin so that it is
- // seekable. To keep code simple we will also buffer data
- // coming from a file.
+ // That means any data in stdin must be buffered, to make it
+ // seekable.
std::string buf;
+ const l_uint8 *data = NULL;
if (stdInput) {
buf.assign((std::istreambuf_iterator<char>(std::cin)),
(std::istreambuf_iterator<char>()));
- } else {
- std::ifstream ifs(filename, std::ios::binary);
- if (ifs) {
- buf.assign((std::istreambuf_iterator<char>(ifs)),
- (std::istreambuf_iterator<char>()));
- } else {
- tprintf("ERROR: Can not open input file %s\n", filename);
- return false;
- }
+ data = reinterpret_cast<const l_uint8 *>(buf.data());
}
// Here is our autodetection
int format;
- const l_uint8 * data = reinterpret_cast<const l_uint8 *>(buf.c_str());
- findFileFormatBuffer(data, &format);
+ int r = (stdInput) ?
+ findFileFormatBuffer(data, &format) :
+ findFileFormat(filename, &format);
// Maybe we have a filelist
- if (format == IFF_UNKNOWN) {
+ if (r != 0 || format == IFF_UNKNOWN) {
STRING s(buf.c_str());
return ProcessPagesFileList(NULL, &s, retry_config,
timeout_millisec, renderer,
@@ -1149,7 +1144,7 @@
// Fail early if we can, before producing any output
Pix *pix = NULL;
if (!tiff) {
- pix = pixReadMem(data, buf.size());
+ pix = (stdInput) ? pixReadMem(data, buf.size()) : pixRead(filename);
if (pix == NULL) {
return false;
}
@@ -1162,16 +1157,15 @@
}
// Produce output
- bool r = false;
- if (tiff) {
- r = ProcessPagesMultipageTiff(data, buf.size(), filename, retry_config,
- timeout_millisec, renderer,
- tesseract_->tessedit_page_number);
- } else {
- r = ProcessPage(pix, 0, filename, retry_config,
- timeout_millisec, renderer);
- pixDestroy(&pix);
- }
+ r = (tiff) ?
+ ProcessPagesMultipageTiff(data, buf.size(), filename, retry_config,
+ timeout_millisec, renderer,
+ tesseract_->tessedit_page_number) :
+ ProcessPage(pix, 0, filename, retry_config,
+ timeout_millisec, renderer);
+
+ // Clean up memory as needed
+ pixDestroy(&pix);
// End the output
if (!r || (renderer && !renderer->EndDocument())) { |
I didn't mean to complain about this without offering a solution. I took a crack at this here: https://github.com/tfmorris/tesseract/tree/tiff-streaming Some other notes (mostly from memory, so take with a grain of salt):
I'll review your proposal in more detail tomorrow to see how it compares with what I started implementing. |
Hello TMorris, Jeff just told me about this thread. And thank you for pointing out the unsatisfactory condition of the multi-tiff read function. The leptonica buck stops with me, and I made a small change that brings it down to linear (not quadratic). I will have it up on github later today. -- Dan |
The patch from me above only buffers images coming from stdin. We could try to optimize this more and limit that buffer to 12 bytes, but it is not obvious how useful that is. I'm guessing it isn't worth even a little extra complexity. |
Looking at my "fix" again, I believe it does NOT succeed in making the read time for N pages linear in N. The problem is that TIFFSetDirectory() always starts at directory 0 in the search. The way to fix this is to use the lower-level functions that TIFFSetDirectory() uses to walk through the directories, grabbing the image at each directory. I'll attempt to remedy this in the next day or so. |
@DanBloomberg Thanks for looking at this. Your most recent comment matches my (slightly vague) memory of how I thought things worked. @jbreiden I'm not really in a position to make value judgements about what's worthwhile and what's not since I'm not familiar with the user base. I think my general plan of attack to keep things clean was to refactor to use the Leptonica stream functions (e.g. pixReadStreamTiff rather than pixReadTiff and pixReadMemTiff), but that would depend on Leptonica being smart enough to handle the page N -> page N+1 case without seeking (or introducing a new pure streaming API). |
OK, it's all properly linearized. No refactoring, no low level functions, no api change, no static vars required. See github.com/danbloomberg/leptonica. |
To implement this properly in tesseract, where you only want one image in memory at the same time, use the same approach that I just did in pixReadMultipageTiff():
The last function is a naked tiff library call. Jeff says that currently all the tiff reading functions are leptonica calls. |
@DanBloomberg Thanks for the outline (and for the new functionality!) Do you seeing any risk in going around Leptonica to libTIFF or do you consider this to be stable enough to be a non-issue? @jbreiden I'm happy to take another crack at this or leave you to it. Let me know. |
There are two separable things under discussion here. The first is the unnecessary buffering. By the way, I checked and file sniffing does indeed return things like IFF_TIFF_G4. I think it makes sense to use my patch. @theraysmith has reviewed it, taken ownership, and will submit it to github. The other topic is TIFF performance. I don't think I want to tackle that one. It's not technically difficult to write a couple libtiff calls to fix the performance problem. However, this would give Tesseract a new, direct build dependency on libtiff. That seems significant enough to warrant discussion on the development mailing list. I don't know how much of an obstacle that would be for users who build from source; it seems that many already struggle with dependencies. I'll talk with Dan about whether there are any other options that make sense. |
Jeff and I agree that you need a direct dependency on the TIFF data As for stability, tiff lib has been extremely stable for 20 years or so -- On Wed, Mar 30, 2016 at 12:04 PM, jbreiden notifications@github.com wrote:
|
If Ray & Jeff have agreed the correct course of action, I'll defer.
|
Tom, Ray hasn't weighed in yet. I believe the question comes down to (1) a I don't know either of these two things. Do you have a timing for a seek -- Dan On Wed, Mar 30, 2016 at 3:28 PM, Tom Morris notifications@github.com
|
Ray is kind of busy so he may be slow to submit my buffering patch to github. It is okay for someone else to submit if time is of the essence. Regarding the TIFF performance issue, I see it in two places. So far we've been discussing TessBaseAPI::ProcessPagesMultipageTiff but we have the exact some problem in MasterTrainer::LoadPageImages tesseract/classify/mastertrainer.cpp Line 219 in dd8c129
|
Getting back to this now that I (might?) have clearance to make direct libtiff calls. |
Proof of concept for a future Leptonica interface that would get us down to linear seeks. This evolved from the conversation in bug #367 #include <stdio.h>
#include <tiffio.h>
const char *testfile = "test.tiff";
size_t PrimeThePump() {
TIFF *tiff = TIFFOpen(testfile, "r");
TIFFSetDirectory(tiff, 0);
size_t offset = TIFFCurrentDirOffset(tiff);
TIFFClose(tiff);
return offset;
}
size_t ThankYouSirMayIHaveAnother(size_t offset) {
TIFF *tiff = TIFFOpen(testfile, "r");
TIFFSetSubDirectory(tiff, offset);
TIFFReadDirectory(tiff);
offset = TIFFCurrentDirOffset(tiff);
TIFFClose(tiff);
return offset;
}
int main(void) {
size_t offset = PrimeThePump();
while (offset = ThankYouSirMayIHaveAnother(offset)) {
printf("offset=%lu\n", offset);
}
} |
It's been a while since I looked at it (and don't have time to recheck now), but that looks about like what I remember thinking would be good. |
* This now resolves longstanding need for linear performance when reading multi-image TIFF files. For example, tesseract should be able to store a million small images in a file and extract them efficiently. See, e.g., tesseract-ocr/tesseract#233 Thanks to Jeff Breidenbach for figuring out how to do this in a general way without exposing TIFF internals to the client.
This patch reduces multipage TIFF seeks from O(n^3) to O(n), but requires the not-yet-released Leptonica 1.74. The patch disables the OpenCL accelerated TIFF codec. I'm confident that I could make OpenCL path work with some effort, but it is hard for me to test and I don't know how active and important Tesseract + OpenCL is these days. --- tesseract/api/baseapi.cpp 2016-05-24 15:32:21.000000000 -0700
+++ tesseract/api/baseapi.cpp 2016-09-13 09:21:41.000000000 -0700
@@ -1025,26 +1025,14 @@
int tessedit_page_number) {
#ifndef ANDROID_BUILD
Pix *pix = NULL;
-#ifdef USE_OPENCL
- OpenclDevice od;
-#endif
int page = (tessedit_page_number >= 0) ? tessedit_page_number : 0;
+ size_t offset = 0;
for (; ; ++page) {
if (tessedit_page_number >= 0)
page = tessedit_page_number;
-#ifdef USE_OPENCL
- if ( od.selectedDeviceIsOpenCL() ) {
pix = (data) ?
- od.pixReadMemTiffCl(data, size, page) :
- od.pixReadTiffCl(filename, page);
- } else {
-#endif
- pix = (data) ?
- pixReadMemTiff(data, size, page) :
- pixReadTiff(filename, page);
-#ifdef USE_OPENCL
- }
-#endif
+ pixReadMemFromMultipageTiff(data, size, &offset) :
+ pixReadFromMultipageTiff(filename, &offset);
if (pix == NULL) break;
tprintf("Page %d\n", page + 1);
char page_str[kMaxIntSize];
@@ -1055,6 +1043,7 @@
pixDestroy(&pix);
if (!r) return false;
if (tessedit_page_number >= 0) break;
+ if (!offset) break;
}
return true;
#else
--- tesseract/classify/mastertrainer.cpp 2016-05-18 14:18:32.000000000 -0700
+++ tesseract/classify/mastertrainer.cpp 2016-09-13 09:30:11.000000000 -0700
@@ -214,10 +214,14 @@
// Must be called after ReadTrainingSamples, as the current number of images
// is used as an offset for page numbers in the samples.
void MasterTrainer::LoadPageImages(const char* filename) {
+ size_t offset = 0;
int page;
Pix* pix;
- for (page = 0; (pix = pixReadTiff(filename, page)) != NULL; ++page) {
+ for (page = 0; ; page++) {
+ pix = pixReadFromMultipageTiff(filename, &offset);
+ if (!pix) break;
page_images_.push_back(pix);
+ if (!offset) break;
}
tprintf("Loaded %d page images from %s\n", page, filename);
} |
Then lets wait for new leptonica release. IMO it would be fine to fix OpenCL too - I got promises that somebody should have a look at opencl issues... |
patch from 2016-03-29 committed as 54fafc4 |
Future tess. release could use lept 1.74. It will be released soon. |
@jbreiden |
Isn't it possible to write code which supports both old and new (> 1.73) Leptonica (using conditional compilation)? I'd prefer such a solution, at least until Leptonica 1.73 or older is no longer used in current distributions. |
I will work with Dan next week to try to keep as much compatibility as
possible.
…On Fri, Nov 25, 2016 at 11:40 AM, Stefan Weil ***@***.***> wrote:
It will be a significant problem to change Leptonica and Tesseract
simultaneously [...]
Isn't it possible to write code which supports both old and new (> 1.73)
Leptonica (using conditional compilation)? I'd prefer such a solution, at
least until Leptonica 1.73 or older is no longer used in current
distributions.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AEu2poRWaz7m4Qob3PFumG3ufwxKuGMDks5rBzmpgaJpZM4HeMNo>
.
|
Thanks, Jeff.
I believe this is an unusual situation where a leptonica interface that
tesseract uses has been changed.
And I want to apologize for the trouble it has caused.
There is a trivial change to textord/imagefind.cpp that fixes this: replace
the last arg in pixGenHalftoneMask() by NULL.
This will skip the debug output images in this function, but it should be
acceptable, and later if someone really wants the extra few debug images we
can add the code to save them.
…On Fri, Nov 25, 2016 at 11:47 AM, jbreiden ***@***.***> wrote:
I will work with Dan next week to try to keep as much compatibility as
possible.
On Fri, Nov 25, 2016 at 11:40 AM, Stefan Weil ***@***.***>
wrote:
> It will be a significant problem to change Leptonica and Tesseract
> simultaneously [...]
>
> Isn't it possible to write code which supports both old and new (> 1.73)
> Leptonica (using conditional compilation)? I'd prefer such a solution, at
> least until Leptonica 1.73 or older is no longer used in current
> distributions.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#233#
issuecomment-263015130>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/
AEu2poRWaz7m4Qob3PFumG3ufwxKuGMDks5rBzmpgaJpZM4HeMNo>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AP6mLMRRBLX64JXlpNpXoezNWvEzsAoWks5rBzs3gaJpZM4HeMNo>
.
|
Let’s talk before Leptonica 1.74 ships. There is a distribution headache if the existing, unmodified Tesseract 3.0.4 can't compile and run with Leptonica 1.74. |
The alternative to updating the pixGenHalftoneMask() function in tesseract
is to make a wrapper in leptonica for the existing tesseract function. This
would simply call the new leptonica function with NULL for the last arg.
…On Fri, Nov 25, 2016 at 12:10 PM, jbreiden ***@***.***> wrote:
Let’s talk before Leptonica 1.74 ships. There is a distribution headache
if the existing, unmodified Tesseract 3.0.4 can't compile and run with
Leptonica 1.74.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AP6mLA_GAbX1LR6Vpbo0vK2ads5jlQATks5rB0C-gaJpZM4HeMNo>
.
|
Yes, this seems to be ABI breakage. Both tesseract and leptonica do not use semver (X.Y.Z), but use (X.Y), so personally I'm confused. With semver ABI breakage only allowed when increasing X number. So, e.g. leptonica should be versioned as 2.00 or whatever (2.0.0?). |
Here is my unofficial take on the ABI shared object version numbers. Jeff
knows about these details and can correct if I'm wrong.
With the Debian releases, we will need to increase the *shared object
version number* (which is different from the leptonica release number) with
any change in the ABI. The Debian leptonica *soversion* for 1.73 is 5.0.0,
and for 1.74 I believe that we'll increase it to 6.0.0.
The meaning of these three digits is, for the shared object name
*whatever.so.X.Y.Z*, we increment
X if the ABI release is backwards incompatible
Y if the ABI release is backwards compatible (with interface changes)
Z if there are only internal changes (no change to the ABI)
So if I write the wrapper mentioned above, and that were the only change,
then the ABI would be backwards compatible and we'd only need to increment
to 5.1.0 for the next Debian release. (However, there have been other
changes, including the removal of deprecated functions.)
…On Fri, Nov 25, 2016 at 1:01 PM, Egor Pugin ***@***.***> wrote:
Yes, this seems ABI breakage. Both tesseract and leptonica do not use
semver (X.Y.Z), but use (X.Y), so personally I'm confused. With semver ABI
breakage only allowed when increasing X number. So, e.g. leptonica should
be versioned as 2.00 or whatever.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AP6mLHznAbmXo5UwOhuUdFnN5qn1DQxCks5rB0y5gaJpZM4HeMNo>
.
|
I've added the wrapper for pixGenHalftoneMask(), so the (not yet released
1.74) git master now should be compatible with tesseract 3.0.4.
With this, tesseract can be changed at your convenience to use the new
interface pixGenerateHalftoneMask() with NULL for the last arg, and both
will compile. We'll keep the old version in leptonica until there is no
further need to use it with tesseract.
On Fri, Nov 25, 2016 at 2:31 PM, Dan Bloomberg <dan.bloomberg@gmail.com>
wrote:
… Here is my unofficial take on the ABI shared object version numbers. Jeff
knows about these details and can correct if I'm wrong.
With the Debian releases, we will need to increase the *shared object
version number* (which is different from the leptonica release number)
with any change in the ABI. The Debian leptonica *soversion* for 1.73 is
5.0.0, and for 1.74 I believe that we'll increase it to 6.0.0.
The meaning of these three digits is, for the shared object name
*whatever.so.X.Y.Z*, we increment
X if the ABI release is backwards incompatible
Y if the ABI release is backwards compatible (with interface changes)
Z if there are only internal changes (no change to the ABI)
So if I write the wrapper mentioned above, and that were the only change,
then the ABI would be backwards compatible and we'd only need to increment
to 5.1.0 for the next Debian release. (However, there have been other
changes, including the removal of deprecated functions.)
On Fri, Nov 25, 2016 at 1:01 PM, Egor Pugin ***@***.***>
wrote:
> Yes, this seems ABI breakage. Both tesseract and leptonica do not use
> semver (X.Y.Z), but use (X.Y), so personally I'm confused. With semver ABI
> breakage only allowed when increasing X number. So, e.g. leptonica should
> be versioned as 2.00 or whatever.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#233 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AP6mLHznAbmXo5UwOhuUdFnN5qn1DQxCks5rB0y5gaJpZM4HeMNo>
> .
>
|
It's ok now. master tess + lept work fine. |
@DanBloomberg: When leptonica 1.74 will be released? |
Just released leptonica 1.74.0 on github :-) |
Also added to cppan. We could stick now to 1.74.0 to prevent possible abi breakages. |
I found this issue while searching for a bug I've been encountering: In a multi-page tiff file, the text is only being extracted from the last page when using Tesseract API [ Leptonica has a method called |
The current multi-page TIFF handling is seriously sub-optimal. First, it unnecessarily reads the entire file into memory which can tie up many MB of memory unnecessarily when pages are being processed in a streaming fashion. Second, I'm pretty sure that accessing by page number causes the entire buffer to be parsed from the beginning every time incurring both processing cost and memory thrashing.
The text was updated successfully, but these errors were encountered: