Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-page TIFF buffering is broken #233

Closed
tfmorris opened this issue Feb 19, 2016 · 49 comments
Closed

Multi-page TIFF buffering is broken #233

tfmorris opened this issue Feb 19, 2016 · 49 comments

Comments

@tfmorris
Copy link
Contributor

The current multi-page TIFF handling is seriously sub-optimal. First, it unnecessarily reads the entire file into memory which can tie up many MB of memory unnecessarily when pages are being processed in a streaming fashion. Second, I'm pretty sure that accessing by page number causes the entire buffer to be parsed from the beginning every time incurring both processing cost and memory thrashing.

@jbreiden
Copy link
Contributor

Please be careful with changes. It is very important not to break streaming support.

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-do-streaming

@tfmorris
Copy link
Contributor Author

Whoever makes the changes will have to make sure that all the tests in the test suite pass, just like any other change to the code base.

Of course this wouldn't be necessary if, when streaming support was added, it was done with some recognition of the performance impact rather than just leaving comments like "To keep code simple we will also buffer data coming from a file." https://github.com/tesseract-ocr/tesseract/blob/master/api/baseapi.cpp#L1119

@amitdo
Copy link
Collaborator

amitdo commented Feb 19, 2016

@jbreiden
Copy link
Contributor

I plead guilty on all counts. I wrote the streaming feature, I failed to write a test, I introduced the performance regression, I wrote the comment, and either I didn't notice the performance regression or failed to properly consider it.

I will attempt to write a test for streaming, and will work with you on TIFF. TIFF is historically tricky for two reasons. One is the duplicated functionality on the OpenCL path. Hard to synchronize and hard to test. I personally can't seem to run the OpenCL path at all without a segfault. Second, under win32 is it hard or impossible to pass a file descriptor between different DLLs. Which is awkward because the libtiff API prefers to work with file descriptors instead of file pointers.

@tfmorris
Copy link
Contributor Author

@jbreiden I don't see any particular reason that your feature should be the first to have a test. As far as I can tell, there are no tests at all for the entire program. Can you point to a more complete description of the file descriptor issue? Does TIFFClientOpen help at all?

Some random, possibly relevant, tidbits:

  • "TIFF is not a streamable format" - http://www.awaresystems.be/imaging/tiff/faq.html#q7
  • the "streaming" support doesn't stream - at least for anything other than a list of file names. All other formats are read entirely into memory before any processing starts. That's a pretty narrow definition of streaming.
  • file format sniffing requires 12 bytes, no more
  • multi-page tiff is on an entirely separate code path, which may make it easier to handle differently

@jbreiden
Copy link
Contributor

I was intimately familiar with the topic a decade ago. I have
tried my best to repress the memories.

http://www.asmail.be/msg0054669449.html

It's nice that the filename is already being passed around;
if we're lucky, maybe we can get what you need without any
API changes. Just out of curiosity, what are you using
multipage TIFF for? I usually think fax images, which are
relatively tiny.

You are right, the streaming feature is super narrow. But
it is important for book digitization.

@jbreiden
Copy link
Contributor

@tfmorris How about something like this? If you are happy with it, I'll hand this change over to Ray for review and eventual inclusion. No API changes. I think functionally it is exactly the same, except that a multipage TIFF from a file does not get buffered.

--- api/baseapi.cpp 2016-03-11 14:29:36.000000000 -0800
+++ api/baseapi.cpp 2016-03-28 15:49:06.000000000 -0700
@@ -1034,11 +1034,14 @@
       page = tessedit_page_number;
 #ifdef USE_OPENCL
     if ( od.selectedDeviceIsOpenCL() ) {
-      // FIXME(jbreiden) Not implemented.
-      pix = od.pixReadMemTiffCl(data, size, page);
+      pix = (data) ?
+          od.pixReadMemTiffCl(data, size, page) :
+          od.pixReadTiffCl(filename, page);
     } else {
 #endif
-      pix = pixReadMemTiff(data, size, page);
+      pix = (data) ?
+          pixReadMemTiff(data, size, page) :
+          pixReadTiff(filename, page);
 #ifdef USE_OPENCL
     }
 #endif
@@ -1086,8 +1089,7 @@
 // makes automatic detection of datatype (TIFF? filelist? PNG?)
 // impractical.  So we support a command line flag to explicitly
 // identify the scenario that really matters: filelists on
-// stdin. We'll still do our best if the user likes pipes.  That means
-// piling up any data coming into stdin into a memory buffer.
+// stdin. We'll still do our best if the user likes pipes.
 bool TessBaseAPI::ProcessPagesInternal(const char* filename,
                                        const char* retry_config,
                                        int timeout_millisec,
@@ -1109,31 +1111,24 @@
   }

   // At this point we are officially in autodection territory.
-  // That means we are going to buffer stdin so that it is
-  // seekable. To keep code simple we will also buffer data
-  // coming from a file.
+  // That means any data in stdin must be buffered, to make it
+  // seekable.
   std::string buf;
+  const l_uint8 *data = NULL;
   if (stdInput) {
     buf.assign((std::istreambuf_iterator<char>(std::cin)),
                (std::istreambuf_iterator<char>()));
-  } else {
-    std::ifstream ifs(filename, std::ios::binary);
-    if (ifs) {
-      buf.assign((std::istreambuf_iterator<char>(ifs)),
-                 (std::istreambuf_iterator<char>()));
-    } else {
-      tprintf("ERROR: Can not open input file %s\n", filename);
-      return false;
-    }
+    data = reinterpret_cast<const l_uint8 *>(buf.data());
   }

   // Here is our autodetection
   int format;
-  const l_uint8 * data = reinterpret_cast<const l_uint8 *>(buf.c_str());
-  findFileFormatBuffer(data, &format);
+  int r = (stdInput) ?
+      findFileFormatBuffer(data, &format) :
+      findFileFormat(filename, &format);

   // Maybe we have a filelist
-  if (format == IFF_UNKNOWN) {
+  if (r != 0 || format == IFF_UNKNOWN) {
     STRING s(buf.c_str());
     return ProcessPagesFileList(NULL, &s, retry_config,
                                 timeout_millisec, renderer,
@@ -1149,7 +1144,7 @@
   // Fail early if we can, before producing any output
   Pix *pix = NULL;
   if (!tiff) {
-    pix = pixReadMem(data, buf.size());
+    pix = (stdInput) ? pixReadMem(data, buf.size()) : pixRead(filename);
     if (pix == NULL) {
       return false;
     }
@@ -1162,16 +1157,15 @@
   }

   // Produce output
-  bool r = false;
-  if (tiff) {
-    r = ProcessPagesMultipageTiff(data, buf.size(), filename, retry_config,
-                                  timeout_millisec, renderer,
-                                  tesseract_->tessedit_page_number);
-  } else {
-    r = ProcessPage(pix, 0, filename, retry_config,
-                    timeout_millisec, renderer);
-    pixDestroy(&pix);
-  }
+  r = (tiff) ?
+      ProcessPagesMultipageTiff(data, buf.size(), filename, retry_config,
+                                timeout_millisec, renderer,
+                                tesseract_->tessedit_page_number) :
+      ProcessPage(pix, 0, filename, retry_config,
+                  timeout_millisec, renderer);
+
+  // Clean up memory as needed
+  pixDestroy(&pix);

   // End the output
   if (!r || (renderer && !renderer->EndDocument())) {

@tfmorris
Copy link
Contributor Author

I didn't mean to complain about this without offering a solution. I took a crack at this here: https://github.com/tfmorris/tesseract/tree/tiff-streaming
but it turned into a bit of a yak shaving exercise, so I dropped it. The main roadblock was that the "right" solution is to implement more reasonable support in Leptonica which exposes more of libTIFFs underlying capabilities. libTIFF knows how to do efficient access, it just gets lost on the way up through the layers.

Some other notes (mostly from memory, so take with a grain of salt):

  • file sniffing only needs/uses 12 bytes, so not very much needs to be buffered
  • file sniffing only returns top level TIFF container format, so all the other TIFF_foo checks can be removed
  • libTIFF is positioned on the next directory after an image is read an knows how to read it directly, but it's Leptonica's readImage(N+1), that forces a rewind, then read 0, read 1, read 2, ..., read N, read N+1.

I'll review your proposal in more detail tomorrow to see how it compares with what I started implementing.

@DanBloomberg
Copy link

Hello TMorris,

Jeff just told me about this thread. And thank you for pointing out the unsatisfactory condition of the multi-tiff read function.

The leptonica buck stops with me, and I made a small change that brings it down to linear (not quadratic). I will have it up on github later today.

-- Dan

@jbreiden
Copy link
Contributor

The patch from me above only buffers images coming from stdin. We could try to optimize this more and limit that buffer to 12 bytes, but it is not obvious how useful that is. I'm guessing it isn't worth even a little extra complexity.

@DanBloomberg
Copy link

Looking at my "fix" again, I believe it does NOT succeed in making the read time for N pages linear in N. The problem is that TIFFSetDirectory() always starts at directory 0 in the search. The way to fix this is to use the lower-level functions that TIFFSetDirectory() uses to walk through the directories, grabbing the image at each directory. I'll attempt to remedy this in the next day or so.

@tfmorris
Copy link
Contributor Author

@DanBloomberg Thanks for looking at this. Your most recent comment matches my (slightly vague) memory of how I thought things worked.

@jbreiden I'm not really in a position to make value judgements about what's worthwhile and what's not since I'm not familiar with the user base. I think my general plan of attack to keep things clean was to refactor to use the Leptonica stream functions (e.g. pixReadStreamTiff rather than pixReadTiff and pixReadMemTiff), but that would depend on Leptonica being smart enough to handle the page N -> page N+1 case without seeking (or introducing a new pure streaming API).

@DanBloomberg
Copy link

OK, it's all properly linearized. No refactoring, no low level functions, no api change, no static vars required.

See github.com/danbloomberg/leptonica.

@DanBloomberg
Copy link

@tfmorris

To implement this properly in tesseract, where you only want one image in memory at the same time, use the same approach that I just did in pixReadMultipageTiff():

  • get a FILE stream
  • use the FILE stream to get a TIFF stream
  • loop:
    • read the pix from the TIFF stream
    • do the OCR
    • call TIFFReadDirectory() to advance to the next image

The last function is a naked tiff library call. Jeff says that currently all the tiff reading functions are leptonica calls.

@tfmorris
Copy link
Contributor Author

@DanBloomberg Thanks for the outline (and for the new functionality!) Do you seeing any risk in going around Leptonica to libTIFF or do you consider this to be stable enough to be a non-issue?

@jbreiden I'm happy to take another crack at this or leave you to it. Let me know.

@jbreiden
Copy link
Contributor

There are two separable things under discussion here. The first is the unnecessary buffering. By the way, I checked and file sniffing does indeed return things like IFF_TIFF_G4. I think it makes sense to use my patch. @theraysmith has reviewed it, taken ownership, and will submit it to github.

The other topic is TIFF performance. I don't think I want to tackle that one. It's not technically difficult to write a couple libtiff calls to fix the performance problem. However, this would give Tesseract a new, direct build dependency on libtiff. That seems significant enough to warrant discussion on the development mailing list. I don't know how much of an obstacle that would be for users who build from source; it seems that many already struggle with dependencies.

I'll talk with Dan about whether there are any other options that make sense.

@DanBloomberg
Copy link

Jeff and I agree that you need a direct dependency on the TIFF data
structure in the tiff library to use the linear method.

As for stability, tiff lib has been extremely stable for 20 years or so --
I wouldn't worry about that.

On Wed, Mar 30, 2016 at 12:04 PM, jbreiden notifications@github.com wrote:

There are two separable things under discussion here. The first is the
unnecessary buffering. By the way, I checked and file sniffing does indeed
return things like IFF_TIFF_G4. I think it makes sense to use my patch.
@theraysmith https://github.com/theraysmith has reviewed it, taken
ownership, and will submit it to github.

The other topic is TIFF performance. I don't think I want to tackle that
one. It's not technically difficult to write a couple libtiff calls to fix
the performance problem. However, this would give Tesseract a new, direct
build dependency on libtiff. That seems significant enough to warrant
discussion on the development mailing list. I don't know how much of an
obstacle that would be for users who build from source; it seems that many
already struggle with dependencies.

I'll talk with Dan about whether there are any other options that make
sense.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#233 (comment)

@tfmorris
Copy link
Contributor Author

tfmorris commented Mar 30, 2016 via email

@DanBloomberg
Copy link

Tom, Ray hasn't weighed in yet. I believe the question comes down to (1) a
comparison between the amount of time to seek to an image in a tiff file
with many images, vs. the time to OCR that image and (2) the "cost" of
having tesseract depend explicitly on the TIFF library (i.e., using TIFF
data structures and library calls directly).

I don't know either of these two things. Do you have a timing for a seek
of hundreds of images in a large multipage tiff file?

-- Dan

On Wed, Mar 30, 2016 at 3:28 PM, Tom Morris notifications@github.com
wrote:

If Ray & Jeff have agreed the correct course of action, I'll defer.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#233 (comment)

@jbreiden
Copy link
Contributor

jbreiden commented Apr 1, 2016

Ray is kind of busy so he may be slow to submit my buffering patch to github. It is okay for someone else to submit if time is of the essence.

Regarding the TIFF performance issue, I see it in two places. So far we've been discussing TessBaseAPI::ProcessPagesMultipageTiff but we have the exact some problem in MasterTrainer::LoadPageImages

for (page = 0; (pix = pixReadTiff(filename, page)) != NULL; ++page) {

@jbreiden
Copy link
Contributor

jbreiden commented Jul 18, 2016

Getting back to this now that I (might?) have clearance to make direct libtiff calls.
First, I see that Ray did not merge my patch above into the github repo. We need to
get that done before I fix the speed problem with multipage TIFF

@jbreiden
Copy link
Contributor

jbreiden commented Jul 18, 2016

Proof of concept for a future Leptonica interface that would get us down to linear seeks. This evolved from the conversation in bug #367

#include <stdio.h>
#include <tiffio.h>

const char *testfile = "test.tiff";

size_t PrimeThePump() {
  TIFF *tiff = TIFFOpen(testfile, "r");
  TIFFSetDirectory(tiff, 0);
  size_t offset = TIFFCurrentDirOffset(tiff);
  TIFFClose(tiff);
  return offset;
}

size_t ThankYouSirMayIHaveAnother(size_t offset) {
  TIFF *tiff = TIFFOpen(testfile, "r");
  TIFFSetSubDirectory(tiff, offset);
  TIFFReadDirectory(tiff);
  offset = TIFFCurrentDirOffset(tiff);
  TIFFClose(tiff);
  return offset;
}

int main(void) {
  size_t offset = PrimeThePump();
  while (offset = ThankYouSirMayIHaveAnother(offset)) {
    printf("offset=%lu\n", offset);
  }
}

@tfmorris
Copy link
Contributor Author

It's been a while since I looked at it (and don't have time to recheck now), but that looks about like what I remember thinking would be good.

DanBloomberg added a commit to DanBloomberg/leptonica that referenced this issue Sep 10, 2016
* This now resolves longstanding need for linear performance when
  reading multi-image TIFF files.  For example, tesseract should be able
  to store a million small images in a file and extract them efficiently.
  See, e.g., tesseract-ocr/tesseract#233
  Thanks to Jeff Breidenbach for figuring out how to do this in a
  general way without exposing TIFF internals to the client.
@jbreiden
Copy link
Contributor

jbreiden commented Sep 13, 2016

This patch reduces multipage TIFF seeks from O(n^3) to O(n), but requires the not-yet-released Leptonica 1.74. The patch disables the OpenCL accelerated TIFF codec. I'm confident that I could make OpenCL path work with some effort, but it is hard for me to test and I don't know how active and important Tesseract + OpenCL is these days.

--- tesseract/api/baseapi.cpp   2016-05-24 15:32:21.000000000 -0700
+++ tesseract/api/baseapi.cpp   2016-09-13 09:21:41.000000000 -0700
@@ -1025,26 +1025,14 @@
                                             int tessedit_page_number) {
 #ifndef ANDROID_BUILD
   Pix *pix = NULL;
-#ifdef USE_OPENCL
-  OpenclDevice od;
-#endif
   int page = (tessedit_page_number >= 0) ? tessedit_page_number : 0;
+  size_t offset = 0;
   for (; ; ++page) {
     if (tessedit_page_number >= 0)
       page = tessedit_page_number;
-#ifdef USE_OPENCL
-    if ( od.selectedDeviceIsOpenCL() ) {
       pix = (data) ?
-          od.pixReadMemTiffCl(data, size, page) :
-          od.pixReadTiffCl(filename, page);
-    } else {
-#endif
-      pix = (data) ?
-          pixReadMemTiff(data, size, page) :
-          pixReadTiff(filename, page);
-#ifdef USE_OPENCL
-    }
-#endif
+          pixReadMemFromMultipageTiff(data, size, &offset) :
+          pixReadFromMultipageTiff(filename, &offset);
     if (pix == NULL) break;
     tprintf("Page %d\n", page + 1);
     char page_str[kMaxIntSize];
@@ -1055,6 +1043,7 @@
     pixDestroy(&pix);
     if (!r) return false;
     if (tessedit_page_number >= 0) break;
+    if (!offset) break;
   }
   return true;
 #else
--- tesseract/classify/mastertrainer.cpp    2016-05-18 14:18:32.000000000 -0700
+++ tesseract/classify/mastertrainer.cpp    2016-09-13 09:30:11.000000000 -0700
@@ -214,10 +214,14 @@
 // Must be called after ReadTrainingSamples, as the current number of images
 // is used as an offset for page numbers in the samples.
 void MasterTrainer::LoadPageImages(const char* filename) {
+  size_t offset = 0;
   int page;
   Pix* pix;
-  for (page = 0; (pix = pixReadTiff(filename, page)) != NULL; ++page) {
+  for (page = 0; ; page++) {
+    pix = pixReadFromMultipageTiff(filename, &offset);
+    if (!pix) break;
     page_images_.push_back(pix);
+    if (!offset) break;
   }
   tprintf("Loaded %d page images from %s\n", page, filename);
 }

@zdenop
Copy link
Contributor

zdenop commented Sep 13, 2016

Then lets wait for new leptonica release. IMO it would be fine to fix OpenCL too - I got promises that somebody should have a look at opencl issues...

@zdenop
Copy link
Contributor

zdenop commented Oct 6, 2016

patch from 2016-03-29 committed as 54fafc4

@egorpugin
Copy link
Contributor

Future tess. release could use lept 1.74. It will be released soon.
We just need administrative decision: e.g. "Let's fixate on lept1.74 for the time being".

@amitdo
Copy link
Collaborator

amitdo commented Nov 25, 2016

@jbreiden
Is there a approximate date for final 4.0 release? Maybe just before the freeze of the next Debian stable or Ubuntu 17.04?

@stweil
Copy link
Member

stweil commented Nov 25, 2016

It will be a significant problem to change Leptonica and Tesseract simultaneously [...]

Isn't it possible to write code which supports both old and new (> 1.73) Leptonica (using conditional compilation)? I'd prefer such a solution, at least until Leptonica 1.73 or older is no longer used in current distributions.

@jbreiden
Copy link
Contributor

jbreiden commented Nov 25, 2016 via email

@DanBloomberg
Copy link

DanBloomberg commented Nov 25, 2016 via email

@jbreiden
Copy link
Contributor

Let’s talk before Leptonica 1.74 ships. There is a distribution headache if the existing, unmodified Tesseract 3.0.4 can't compile and run with Leptonica 1.74.

@DanBloomberg
Copy link

DanBloomberg commented Nov 25, 2016 via email

@egorpugin
Copy link
Contributor

egorpugin commented Nov 25, 2016

Yes, this seems to be ABI breakage. Both tesseract and leptonica do not use semver (X.Y.Z), but use (X.Y), so personally I'm confused. With semver ABI breakage only allowed when increasing X number. So, e.g. leptonica should be versioned as 2.00 or whatever (2.0.0?).

@DanBloomberg
Copy link

DanBloomberg commented Nov 25, 2016 via email

@DanBloomberg
Copy link

DanBloomberg commented Nov 26, 2016 via email

@egorpugin
Copy link
Contributor

It's ok now. master tess + lept work fine.

@zdenop
Copy link
Contributor

zdenop commented Dec 23, 2016

@DanBloomberg: When leptonica 1.74 will be released?

@DanBloomberg
Copy link

Just released leptonica 1.74.0 on github :-)

@egorpugin
Copy link
Contributor

Also added to cppan.
https://cppan.org/pvt.cppan.demo.danbloomberg.leptonica/versions

We could stick now to 1.74.0 to prevent possible abi breakages.

@Omnipresent
Copy link

Omnipresent commented Sep 17, 2017

I found this issue while searching for a bug I've been encountering: In a multi-page tiff file, the text is only being extracted from the last page when using Tesseract API [TessBaseAPIProcessPages] (#1138)

Leptonica has a method called pixaReadMultipageTiff , would that need to be used instead?

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this issue Mar 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants