Multi-page TIFF buffering is broken #233

tfmorris · 2016-02-19T17:42:45Z

The current multi-page TIFF handling is seriously sub-optimal. First, it unnecessarily reads the entire file into memory which can tie up many MB of memory unnecessarily when pages are being processed in a streaming fashion. Second, I'm pretty sure that accessing by page number causes the entire buffer to be parsed from the beginning every time incurring both processing cost and memory thrashing.

jbreiden · 2016-02-19T21:17:21Z

Please be careful with changes. It is very important not to break streaming support.

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-do-streaming

tfmorris · 2016-02-19T23:07:48Z

Whoever makes the changes will have to make sure that all the tests in the test suite pass, just like any other change to the code base.

Of course this wouldn't be necessary if, when streaming support was added, it was done with some recognition of the performance impact rather than just leaving comments like "To keep code simple we will also buffer data coming from a file." https://github.com/tesseract-ocr/tesseract/blob/master/api/baseapi.cpp#L1119

amitdo · 2016-02-19T23:54:58Z

https://groups.google.com/forum/?hl=en#!topic/tesseract-dev/LMo_igM4z90

jbreiden · 2016-02-20T00:15:53Z

I plead guilty on all counts. I wrote the streaming feature, I failed to write a test, I introduced the performance regression, I wrote the comment, and either I didn't notice the performance regression or failed to properly consider it.

I will attempt to write a test for streaming, and will work with you on TIFF. TIFF is historically tricky for two reasons. One is the duplicated functionality on the OpenCL path. Hard to synchronize and hard to test. I personally can't seem to run the OpenCL path at all without a segfault. Second, under win32 is it hard or impossible to pass a file descriptor between different DLLs. Which is awkward because the libtiff API prefers to work with file descriptors instead of file pointers.

tfmorris · 2016-02-20T04:03:28Z

@jbreiden I don't see any particular reason that your feature should be the first to have a test. As far as I can tell, there are no tests at all for the entire program. Can you point to a more complete description of the file descriptor issue? Does TIFFClientOpen help at all?

Some random, possibly relevant, tidbits:

"TIFF is not a streamable format" - http://www.awaresystems.be/imaging/tiff/faq.html#q7
the "streaming" support doesn't stream - at least for anything other than a list of file names. All other formats are read entirely into memory before any processing starts. That's a pretty narrow definition of streaming.
file format sniffing requires 12 bytes, no more
multi-page tiff is on an entirely separate code path, which may make it easier to handle differently

jbreiden · 2016-02-20T05:24:41Z

I was intimately familiar with the topic a decade ago. I have
tried my best to repress the memories.

http://www.asmail.be/msg0054669449.html

It's nice that the filename is already being passed around;
if we're lucky, maybe we can get what you need without any
API changes. Just out of curiosity, what are you using
multipage TIFF for? I usually think fax images, which are
relatively tiny.

You are right, the streaming feature is super narrow. But
it is important for book digitization.

jbreiden · 2016-03-28T22:51:41Z

@tfmorris How about something like this? If you are happy with it, I'll hand this change over to Ray for review and eventual inclusion. No API changes. I think functionally it is exactly the same, except that a multipage TIFF from a file does not get buffered.

--- api/baseapi.cpp 2016-03-11 14:29:36.000000000 -0800
+++ api/baseapi.cpp 2016-03-28 15:49:06.000000000 -0700
@@ -1034,11 +1034,14 @@
       page = tessedit_page_number;
 #ifdef USE_OPENCL
     if ( od.selectedDeviceIsOpenCL() ) {
-      // FIXME(jbreiden) Not implemented.
-      pix = od.pixReadMemTiffCl(data, size, page);
+      pix = (data) ?
+          od.pixReadMemTiffCl(data, size, page) :
+          od.pixReadTiffCl(filename, page);
     } else {
 #endif
-      pix = pixReadMemTiff(data, size, page);
+      pix = (data) ?
+          pixReadMemTiff(data, size, page) :
+          pixReadTiff(filename, page);
 #ifdef USE_OPENCL
     }
 #endif
@@ -1086,8 +1089,7 @@
 // makes automatic detection of datatype (TIFF? filelist? PNG?)
 // impractical.  So we support a command line flag to explicitly
 // identify the scenario that really matters: filelists on
-// stdin. We'll still do our best if the user likes pipes.  That means
-// piling up any data coming into stdin into a memory buffer.
+// stdin. We'll still do our best if the user likes pipes.
 bool TessBaseAPI::ProcessPagesInternal(const char* filename,
                                        const char* retry_config,
                                        int timeout_millisec,
@@ -1109,31 +1111,24 @@
   }

   // At this point we are officially in autodection territory.
-  // That means we are going to buffer stdin so that it is
-  // seekable. To keep code simple we will also buffer data
-  // coming from a file.
+  // That means any data in stdin must be buffered, to make it
+  // seekable.
   std::string buf;
+  const l_uint8 *data = NULL;
   if (stdInput) {
     buf.assign((std::istreambuf_iterator<char>(std::cin)),
                (std::istreambuf_iterator<char>()));
-  } else {
-    std::ifstream ifs(filename, std::ios::binary);
-    if (ifs) {
-      buf.assign((std::istreambuf_iterator<char>(ifs)),
-                 (std::istreambuf_iterator<char>()));
-    } else {
-      tprintf("ERROR: Can not open input file %s\n", filename);
-      return false;
-    }
+    data = reinterpret_cast<const l_uint8 *>(buf.data());
   }

   // Here is our autodetection
   int format;
-  const l_uint8 * data = reinterpret_cast<const l_uint8 *>(buf.c_str());
-  findFileFormatBuffer(data, &format);
+  int r = (stdInput) ?
+      findFileFormatBuffer(data, &format) :
+      findFileFormat(filename, &format);

   // Maybe we have a filelist
-  if (format == IFF_UNKNOWN) {
+  if (r != 0 || format == IFF_UNKNOWN) {
     STRING s(buf.c_str());
     return ProcessPagesFileList(NULL, &s, retry_config,
                                 timeout_millisec, renderer,
@@ -1149,7 +1144,7 @@
   // Fail early if we can, before producing any output
   Pix *pix = NULL;
   if (!tiff) {
-    pix = pixReadMem(data, buf.size());
+    pix = (stdInput) ? pixReadMem(data, buf.size()) : pixRead(filename);
     if (pix == NULL) {
       return false;
     }
@@ -1162,16 +1157,15 @@
   }

   // Produce output
-  bool r = false;
-  if (tiff) {
-    r = ProcessPagesMultipageTiff(data, buf.size(), filename, retry_config,
-                                  timeout_millisec, renderer,
-                                  tesseract_->tessedit_page_number);
-  } else {
-    r = ProcessPage(pix, 0, filename, retry_config,
-                    timeout_millisec, renderer);
-    pixDestroy(&pix);
-  }
+  r = (tiff) ?
+      ProcessPagesMultipageTiff(data, buf.size(), filename, retry_config,
+                                timeout_millisec, renderer,
+                                tesseract_->tessedit_page_number) :
+      ProcessPage(pix, 0, filename, retry_config,
+                  timeout_millisec, renderer);
+
+  // Clean up memory as needed
+  pixDestroy(&pix);

   // End the output
   if (!r || (renderer && !renderer->EndDocument())) {

tfmorris · 2016-03-29T03:48:48Z

I didn't mean to complain about this without offering a solution. I took a crack at this here: https://github.com/tfmorris/tesseract/tree/tiff-streaming
but it turned into a bit of a yak shaving exercise, so I dropped it. The main roadblock was that the "right" solution is to implement more reasonable support in Leptonica which exposes more of libTIFFs underlying capabilities. libTIFF knows how to do efficient access, it just gets lost on the way up through the layers.

Some other notes (mostly from memory, so take with a grain of salt):

file sniffing only needs/uses 12 bytes, so not very much needs to be buffered
file sniffing only returns top level TIFF container format, so all the other TIFF_foo checks can be removed
libTIFF is positioned on the next directory after an image is read an knows how to read it directly, but it's Leptonica's readImage(N+1), that forces a rewind, then read 0, read 1, read 2, ..., read N, read N+1.

I'll review your proposal in more detail tomorrow to see how it compares with what I started implementing.

DanBloomberg · 2016-03-29T19:41:04Z

Hello TMorris,

Jeff just told me about this thread. And thank you for pointing out the unsatisfactory condition of the multi-tiff read function.

The leptonica buck stops with me, and I made a small change that brings it down to linear (not quadratic). I will have it up on github later today.

-- Dan

jbreiden · 2016-03-29T19:59:25Z

The patch from me above only buffers images coming from stdin. We could try to optimize this more and limit that buffer to 12 bytes, but it is not obvious how useful that is. I'm guessing it isn't worth even a little extra complexity.

DanBloomberg · 2016-03-29T22:29:55Z

Looking at my "fix" again, I believe it does NOT succeed in making the read time for N pages linear in N. The problem is that TIFFSetDirectory() always starts at directory 0 in the search. The way to fix this is to use the lower-level functions that TIFFSetDirectory() uses to walk through the directories, grabbing the image at each directory. I'll attempt to remedy this in the next day or so.

tfmorris · 2016-03-29T23:13:09Z

@DanBloomberg Thanks for looking at this. Your most recent comment matches my (slightly vague) memory of how I thought things worked.

@jbreiden I'm not really in a position to make value judgements about what's worthwhile and what's not since I'm not familiar with the user base. I think my general plan of attack to keep things clean was to refactor to use the Leptonica stream functions (e.g. pixReadStreamTiff rather than pixReadTiff and pixReadMemTiff), but that would depend on Leptonica being smart enough to handle the page N -> page N+1 case without seeking (or introducing a new pure streaming API).

DanBloomberg · 2016-03-30T00:06:27Z

OK, it's all properly linearized. No refactoring, no low level functions, no api change, no static vars required.

See github.com/danbloomberg/leptonica.

DanBloomberg · 2016-03-30T00:23:58Z

@tfmorris

To implement this properly in tesseract, where you only want one image in memory at the same time, use the same approach that I just did in pixReadMultipageTiff():

get a FILE stream
use the FILE stream to get a TIFF stream
loop:
- read the pix from the TIFF stream
- do the OCR
- call TIFFReadDirectory() to advance to the next image

The last function is a naked tiff library call. Jeff says that currently all the tiff reading functions are leptonica calls.

tfmorris · 2016-03-30T15:10:51Z

@DanBloomberg Thanks for the outline (and for the new functionality!) Do you seeing any risk in going around Leptonica to libTIFF or do you consider this to be stable enough to be a non-issue?

@jbreiden I'm happy to take another crack at this or leave you to it. Let me know.

jbreiden · 2016-03-30T19:03:48Z

There are two separable things under discussion here. The first is the unnecessary buffering. By the way, I checked and file sniffing does indeed return things like IFF_TIFF_G4. I think it makes sense to use my patch. @theraysmith has reviewed it, taken ownership, and will submit it to github.

The other topic is TIFF performance. I don't think I want to tackle that one. It's not technically difficult to write a couple libtiff calls to fix the performance problem. However, this would give Tesseract a new, direct build dependency on libtiff. That seems significant enough to warrant discussion on the development mailing list. I don't know how much of an obstacle that would be for users who build from source; it seems that many already struggle with dependencies.

I'll talk with Dan about whether there are any other options that make sense.

DanBloomberg · 2016-03-30T20:07:01Z

Jeff and I agree that you need a direct dependency on the TIFF data
structure in the tiff library to use the linear method.

As for stability, tiff lib has been extremely stable for 20 years or so --
I wouldn't worry about that.

On Wed, Mar 30, 2016 at 12:04 PM, jbreiden notifications@github.com wrote:

There are two separable things under discussion here. The first is the
unnecessary buffering. By the way, I checked and file sniffing does indeed
return things like IFF_TIFF_G4. I think it makes sense to use my patch.
@theraysmith https://github.com/theraysmith has reviewed it, taken
ownership, and will submit it to github.

The other topic is TIFF performance. I don't think I want to tackle that
one. It's not technically difficult to write a couple libtiff calls to fix
the performance problem. However, this would give Tesseract a new, direct
build dependency on libtiff. That seems significant enough to warrant
discussion on the development mailing list. I don't know how much of an
obstacle that would be for users who build from source; it seems that many
already struggle with dependencies.

I'll talk with Dan about whether there are any other options that make
sense.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#233 (comment)

tfmorris · 2016-03-30T22:27:51Z

If Ray & Jeff have agreed the correct course of action, I'll defer.

DanBloomberg · 2016-03-30T23:20:28Z

Tom, Ray hasn't weighed in yet. I believe the question comes down to (1) a
comparison between the amount of time to seek to an image in a tiff file
with many images, vs. the time to OCR that image and (2) the "cost" of
having tesseract depend explicitly on the TIFF library (i.e., using TIFF
data structures and library calls directly).

I don't know either of these two things. Do you have a timing for a seek
of hundreds of images in a large multipage tiff file?

-- Dan

On Wed, Mar 30, 2016 at 3:28 PM, Tom Morris notifications@github.com
wrote:

If Ray & Jeff have agreed the correct course of action, I'll defer.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#233 (comment)

jbreiden · 2016-04-01T20:01:04Z

Ray is kind of busy so he may be slow to submit my buffering patch to github. It is okay for someone else to submit if time is of the essence.

Regarding the TIFF performance issue, I see it in two places. So far we've been discussing TessBaseAPI::ProcessPagesMultipageTiff but we have the exact some problem in MasterTrainer::LoadPageImages

tesseract/classify/mastertrainer.cpp

Line 219 in dd8c129

for (page = 0; (pix = pixReadTiff(filename, page)) != NULL; ++page) {

jbreiden · 2016-07-18T17:40:14Z

Getting back to this now that I (might?) have clearance to make direct libtiff calls.
First, I see that Ray did not merge my patch above into the github repo. We need to
get that done before I fix the speed problem with multipage TIFF

jbreiden · 2016-07-18T20:09:08Z

Proof of concept for a future Leptonica interface that would get us down to linear seeks. This evolved from the conversation in bug #367

#include <stdio.h>
#include <tiffio.h>

const char *testfile = "test.tiff";

size_t PrimeThePump() {
  TIFF *tiff = TIFFOpen(testfile, "r");
  TIFFSetDirectory(tiff, 0);
  size_t offset = TIFFCurrentDirOffset(tiff);
  TIFFClose(tiff);
  return offset;
}

size_t ThankYouSirMayIHaveAnother(size_t offset) {
  TIFF *tiff = TIFFOpen(testfile, "r");
  TIFFSetSubDirectory(tiff, offset);
  TIFFReadDirectory(tiff);
  offset = TIFFCurrentDirOffset(tiff);
  TIFFClose(tiff);
  return offset;
}

int main(void) {
  size_t offset = PrimeThePump();
  while (offset = ThankYouSirMayIHaveAnother(offset)) {
    printf("offset=%lu\n", offset);
  }
}

tfmorris · 2016-07-19T04:36:55Z

It's been a while since I looked at it (and don't have time to recheck now), but that looks about like what I remember thinking would be good.

* This now resolves longstanding need for linear performance when reading multi-image TIFF files. For example, tesseract should be able to store a million small images in a file and extract them efficiently. See, e.g., tesseract-ocr/tesseract#233 Thanks to Jeff Breidenbach for figuring out how to do this in a general way without exposing TIFF internals to the client.

jbreiden · 2016-09-13T16:43:30Z

This patch reduces multipage TIFF seeks from O(n^3) to O(n), but requires the not-yet-released Leptonica 1.74. The patch disables the OpenCL accelerated TIFF codec. I'm confident that I could make OpenCL path work with some effort, but it is hard for me to test and I don't know how active and important Tesseract + OpenCL is these days.

--- tesseract/api/baseapi.cpp   2016-05-24 15:32:21.000000000 -0700
+++ tesseract/api/baseapi.cpp   2016-09-13 09:21:41.000000000 -0700
@@ -1025,26 +1025,14 @@
                                             int tessedit_page_number) {
 #ifndef ANDROID_BUILD
   Pix *pix = NULL;
-#ifdef USE_OPENCL
-  OpenclDevice od;
-#endif
   int page = (tessedit_page_number >= 0) ? tessedit_page_number : 0;
+  size_t offset = 0;
   for (; ; ++page) {
     if (tessedit_page_number >= 0)
       page = tessedit_page_number;
-#ifdef USE_OPENCL
-    if ( od.selectedDeviceIsOpenCL() ) {
       pix = (data) ?
-          od.pixReadMemTiffCl(data, size, page) :
-          od.pixReadTiffCl(filename, page);
-    } else {
-#endif
-      pix = (data) ?
-          pixReadMemTiff(data, size, page) :
-          pixReadTiff(filename, page);
-#ifdef USE_OPENCL
-    }
-#endif
+          pixReadMemFromMultipageTiff(data, size, &offset) :
+          pixReadFromMultipageTiff(filename, &offset);
     if (pix == NULL) break;
     tprintf("Page %d\n", page + 1);
     char page_str[kMaxIntSize];
@@ -1055,6 +1043,7 @@
     pixDestroy(&pix);
     if (!r) return false;
     if (tessedit_page_number >= 0) break;
+    if (!offset) break;
   }
   return true;
 #else
--- tesseract/classify/mastertrainer.cpp    2016-05-18 14:18:32.000000000 -0700
+++ tesseract/classify/mastertrainer.cpp    2016-09-13 09:30:11.000000000 -0700
@@ -214,10 +214,14 @@
 // Must be called after ReadTrainingSamples, as the current number of images
 // is used as an offset for page numbers in the samples.
 void MasterTrainer::LoadPageImages(const char* filename) {
+  size_t offset = 0;
   int page;
   Pix* pix;
-  for (page = 0; (pix = pixReadTiff(filename, page)) != NULL; ++page) {
+  for (page = 0; ; page++) {
+    pix = pixReadFromMultipageTiff(filename, &offset);
+    if (!pix) break;
     page_images_.push_back(pix);
+    if (!offset) break;
   }
   tprintf("Loaded %d page images from %s\n", page, filename);
 }

zdenop · 2016-09-13T20:05:44Z

Then lets wait for new leptonica release. IMO it would be fine to fix OpenCL too - I got promises that somebody should have a look at opencl issues...

zdenop · 2016-10-06T14:24:57Z

patch from 2016-03-29 committed as 54fafc4

egorpugin · 2016-11-25T18:39:04Z

Future tess. release could use lept 1.74. It will be released soon.
We just need administrative decision: e.g. "Let's fixate on lept1.74 for the time being".

amitdo · 2016-11-25T18:55:17Z

@jbreiden
Is there a approximate date for final 4.0 release? Maybe just before the freeze of the next Debian stable or Ubuntu 17.04?

stweil · 2016-11-25T19:40:20Z

It will be a significant problem to change Leptonica and Tesseract simultaneously [...]

Isn't it possible to write code which supports both old and new (> 1.73) Leptonica (using conditional compilation)? I'd prefer such a solution, at least until Leptonica 1.73 or older is no longer used in current distributions.

jbreiden · 2016-11-25T19:46:51Z

I will work with Dan next week to try to keep as much compatibility as possible.

…

On Fri, Nov 25, 2016 at 11:40 AM, Stefan Weil ***@***.***> wrote: It will be a significant problem to change Leptonica and Tesseract simultaneously [...] Isn't it possible to write code which supports both old and new (> 1.73) Leptonica (using conditional compilation)? I'd prefer such a solution, at least until Leptonica 1.73 or older is no longer used in current distributions. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEu2poRWaz7m4Qob3PFumG3ufwxKuGMDks5rBzmpgaJpZM4HeMNo> .

DanBloomberg · 2016-11-25T20:03:53Z

Thanks, Jeff. I believe this is an unusual situation where a leptonica interface that tesseract uses has been changed. And I want to apologize for the trouble it has caused. There is a trivial change to textord/imagefind.cpp that fixes this: replace the last arg in pixGenHalftoneMask() by NULL. This will skip the debug output images in this function, but it should be acceptable, and later if someone really wants the extra few debug images we can add the code to save them.

…

On Fri, Nov 25, 2016 at 11:47 AM, jbreiden ***@***.***> wrote: I will work with Dan next week to try to keep as much compatibility as possible. On Fri, Nov 25, 2016 at 11:40 AM, Stefan Weil ***@***.***> wrote: > It will be a significant problem to change Leptonica and Tesseract > simultaneously [...] > > Isn't it possible to write code which supports both old and new (> 1.73) > Leptonica (using conditional compilation)? I'd prefer such a solution, at > least until Leptonica 1.73 or older is no longer used in current > distributions. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#233# issuecomment-263015130>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ AEu2poRWaz7m4Qob3PFumG3ufwxKuGMDks5rBzmpgaJpZM4HeMNo> > . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP6mLMRRBLX64JXlpNpXoezNWvEzsAoWks5rBzs3gaJpZM4HeMNo> .

jbreiden · 2016-11-25T20:10:27Z

Let’s talk before Leptonica 1.74 ships. There is a distribution headache if the existing, unmodified Tesseract 3.0.4 can't compile and run with Leptonica 1.74.

DanBloomberg · 2016-11-25T20:54:17Z

The alternative to updating the pixGenHalftoneMask() function in tesseract is to make a wrapper in leptonica for the existing tesseract function. This would simply call the new leptonica function with NULL for the last arg.

…

On Fri, Nov 25, 2016 at 12:10 PM, jbreiden ***@***.***> wrote: Let’s talk before Leptonica 1.74 ships. There is a distribution headache if the existing, unmodified Tesseract 3.0.4 can't compile and run with Leptonica 1.74. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP6mLA_GAbX1LR6Vpbo0vK2ads5jlQATks5rB0C-gaJpZM4HeMNo> .

egorpugin · 2016-11-25T21:01:35Z

Yes, this seems to be ABI breakage. Both tesseract and leptonica do not use semver (X.Y.Z), but use (X.Y), so personally I'm confused. With semver ABI breakage only allowed when increasing X number. So, e.g. leptonica should be versioned as 2.00 or whatever (2.0.0?).

DanBloomberg · 2016-11-25T22:31:48Z

Here is my unofficial take on the ABI shared object version numbers. Jeff knows about these details and can correct if I'm wrong. With the Debian releases, we will need to increase the *shared object version number* (which is different from the leptonica release number) with any change in the ABI. The Debian leptonica *soversion* for 1.73 is 5.0.0, and for 1.74 I believe that we'll increase it to 6.0.0. The meaning of these three digits is, for the shared object name *whatever.so.X.Y.Z*, we increment X if the ABI release is backwards incompatible Y if the ABI release is backwards compatible (with interface changes) Z if there are only internal changes (no change to the ABI) So if I write the wrapper mentioned above, and that were the only change, then the ABI would be backwards compatible and we'd only need to increment to 5.1.0 for the next Debian release. (However, there have been other changes, including the removal of deprecated functions.)

…

On Fri, Nov 25, 2016 at 1:01 PM, Egor Pugin ***@***.***> wrote: Yes, this seems ABI breakage. Both tesseract and leptonica do not use semver (X.Y.Z), but use (X.Y), so personally I'm confused. With semver ABI breakage only allowed when increasing X number. So, e.g. leptonica should be versioned as 2.00 or whatever. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AP6mLHznAbmXo5UwOhuUdFnN5qn1DQxCks5rB0y5gaJpZM4HeMNo> .

DanBloomberg · 2016-11-26T00:32:02Z

I've added the wrapper for pixGenHalftoneMask(), so the (not yet released 1.74) git master now should be compatible with tesseract 3.0.4. With this, tesseract can be changed at your convenience to use the new interface pixGenerateHalftoneMask() with NULL for the last arg, and both will compile. We'll keep the old version in leptonica until there is no further need to use it with tesseract. On Fri, Nov 25, 2016 at 2:31 PM, Dan Bloomberg <dan.bloomberg@gmail.com> wrote:

…

Here is my unofficial take on the ABI shared object version numbers. Jeff knows about these details and can correct if I'm wrong. With the Debian releases, we will need to increase the *shared object version number* (which is different from the leptonica release number) with any change in the ABI. The Debian leptonica *soversion* for 1.73 is 5.0.0, and for 1.74 I believe that we'll increase it to 6.0.0. The meaning of these three digits is, for the shared object name *whatever.so.X.Y.Z*, we increment X if the ABI release is backwards incompatible Y if the ABI release is backwards compatible (with interface changes) Z if there are only internal changes (no change to the ABI) So if I write the wrapper mentioned above, and that were the only change, then the ABI would be backwards compatible and we'd only need to increment to 5.1.0 for the next Debian release. (However, there have been other changes, including the removal of deprecated functions.) On Fri, Nov 25, 2016 at 1:01 PM, Egor Pugin ***@***.***> wrote: > Yes, this seems ABI breakage. Both tesseract and leptonica do not use > semver (X.Y.Z), but use (X.Y), so personally I'm confused. With semver ABI > breakage only allowed when increasing X number. So, e.g. leptonica should > be versioned as 2.00 or whatever. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#233 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AP6mLHznAbmXo5UwOhuUdFnN5qn1DQxCks5rB0y5gaJpZM4HeMNo> > . >

egorpugin · 2016-11-26T10:23:27Z

It's ok now. master tess + lept work fine.

zdenop · 2016-12-23T08:42:07Z

@DanBloomberg: When leptonica 1.74 will be released?

DanBloomberg · 2016-12-23T19:13:15Z

Just released leptonica 1.74.0 on github :-)

egorpugin · 2016-12-23T19:18:17Z

Also added to cppan.
https://cppan.org/pvt.cppan.demo.danbloomberg.leptonica/versions

We could stick now to 1.74.0 to prevent possible abi breakages.

Omnipresent · 2017-09-17T22:30:29Z

I found this issue while searching for a bug I've been encountering: In a multi-page tiff file, the text is only being extracted from the last page when using Tesseract API [TessBaseAPIProcessPages] (#1138)

Leptonica has a method called pixaReadMultipageTiff , would that need to be used instead?

amitdo added the feature request label May 27, 2016

jbreiden mentioned this issue Jul 16, 2016

win32: Show TIFF warnings on console #367

Merged

jbreiden mentioned this issue Oct 4, 2016

Speckled Documents Create Psychological Case for Tesseract #431

Closed

amitdo mentioned this issue Dec 7, 2016

which versions of leptonica are being enforced; release version vs. SO version #540

Closed

zdenop closed this as completed in 11f2057 Dec 24, 2016

zdenop added a commit that referenced this issue Dec 26, 2016

Multi-page TIFF buffering is broken - fix #233

245eebd

jbreiden mentioned this issue Jan 3, 2017

TessBaseAPI::ProcessPagesMultipageTiff has obsolete OpenCL code #635

Closed

Shreeshrii mentioned this issue Mar 26, 2017

Tesseract not working fine with arabic #791

Closed

amitdo mentioned this issue Apr 23, 2017

3.05: Backports from master branch #835

Merged

otiai10 mentioned this issue Nov 5, 2018

Request for info: support for multi-page tiffs otiai10/gosseract#136

Open

stweil mentioned this issue Jul 4, 2019

fix read wrong tiff page. #2538

Merged

zdenop mentioned this issue Jul 5, 2019

Fix handling of single pages from multipage TIFF files (issue #2537) #2542

Merged

amitdo added the performance label May 14, 2020

amitdo added the leptonica label Mar 22, 2021

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021

Multi-page TIFF buffering is broken - fix tesseract-ocr#233

31048b9

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021

Multi-page TIFF buffering is broken - fix tesseract-ocr#233

d11f7a0

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021

Multi-page TIFF buffering is broken - fix tesseract-ocr#233

ecdb908

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021

Multi-page TIFF buffering is broken - fix tesseract-ocr#233

e174151

Multi-page TIFF buffering is broken #233

Multi-page TIFF buffering is broken #233

Comments

tfmorris commented Feb 19, 2016

jbreiden commented Feb 19, 2016

tfmorris commented Feb 19, 2016

amitdo commented Feb 19, 2016

jbreiden commented Feb 20, 2016

tfmorris commented Feb 20, 2016

jbreiden commented Feb 20, 2016

jbreiden commented Mar 28, 2016

tfmorris commented Mar 29, 2016

DanBloomberg commented Mar 29, 2016

jbreiden commented Mar 29, 2016

DanBloomberg commented Mar 29, 2016

tfmorris commented Mar 29, 2016

DanBloomberg commented Mar 30, 2016

DanBloomberg commented Mar 30, 2016

tfmorris commented Mar 30, 2016

jbreiden commented Mar 30, 2016

DanBloomberg commented Mar 30, 2016

tfmorris commented Mar 30, 2016 via email

DanBloomberg commented Mar 30, 2016

jbreiden commented Apr 1, 2016

jbreiden commented Jul 18, 2016 • edited Loading

jbreiden commented Jul 18, 2016 • edited Loading

tfmorris commented Jul 19, 2016

jbreiden commented Sep 13, 2016 • edited Loading

zdenop commented Sep 13, 2016

zdenop commented Oct 6, 2016

egorpugin commented Nov 25, 2016

amitdo commented Nov 25, 2016 • edited Loading

stweil commented Nov 25, 2016

jbreiden commented Nov 25, 2016 via email

DanBloomberg commented Nov 25, 2016 via email

jbreiden commented Nov 25, 2016

DanBloomberg commented Nov 25, 2016 via email

egorpugin commented Nov 25, 2016 • edited Loading

DanBloomberg commented Nov 25, 2016 via email

DanBloomberg commented Nov 26, 2016 via email

egorpugin commented Nov 26, 2016

zdenop commented Dec 23, 2016

DanBloomberg commented Dec 23, 2016

egorpugin commented Dec 23, 2016

Omnipresent commented Sep 17, 2017 • edited Loading

jbreiden commented Jul 18, 2016 •

edited

Loading

jbreiden commented Jul 18, 2016 •

edited

Loading

jbreiden commented Sep 13, 2016 •

edited

Loading

amitdo commented Nov 25, 2016 •

edited

Loading

egorpugin commented Nov 25, 2016 •

edited

Loading

Omnipresent commented Sep 17, 2017 •

edited

Loading