Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win32: Show TIFF warnings on console #367

Merged
merged 2 commits into from
Jul 17, 2016
Merged

Conversation

stweil
Copy link
Member

@stweil stweil commented Jul 16, 2016

Showing them in a window (default) is not acceptable for a console
application like Tesseract which must be able to work in batch mode.

Signed-off-by: Stefan Weil sw@weilnetz.de

Showing them in a window (default) is not acceptable for a console
application like Tesseract which must be able to work in batch mode.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@egorpugin
Copy link
Contributor

TIFF gui warnings probably should be turned off during tiff library compilation with definition TIF_PLATFORM_CONSOLE.

@stweil
Copy link
Member Author

stweil commented Jul 16, 2016

All precompiled TIFF libraries for Windows which I know (e .g. the mingw64-i686-tiff which is part of cygwin) don't define TIF_PLATFORM_CONSOLE.

I don't think that it would be a good idea to require using a special TIFF library for Tesseract.

So yes, in theory your suggestion is an alternative solution, but it is not feasible in practice.

@egorpugin
Copy link
Contributor

Your commit introduces dependency on tiff library for tesseract.
Now by default tesseract uses only leptonica library.

@jbreiden
Copy link
Contributor

I think there is another bug about better memory usage with multipage tiff, where the solution is also a tiff library dependency. Will try to find it.

@jbreiden
Copy link
Contributor

Yes, it was bug #223. There is an enhancement request to improve performance on multipage tiff reads. It is currently slow to read the latter images in multipage tiff. The only way to accomplish this is with a direct libtiff dependency. I didn't implement because I wasn't sure if it was worth it.

@jbreiden
Copy link
Contributor

I mean bug #233

The previous commit added a dependency on tiffio.h, so enable the new
code only if that file is available.

The code which conditionally defines HAVE_TIFFIO_H was already there
although that macro was unused up to now.

Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Member Author

stweil commented Jul 17, 2016

Some additional notes on the problem which is addressed by this PR:

It looks like many TIFF files (nearly all?) include some vendor specific data which trigger warnings from libtiff, so it is a real and very common problem for Windows users who want to do batch processing. I see these alternatives to handle the problem:

  1. Change libtiff. Instead of defaulting to a message window for all warnings, it could do so for GUI applications but use stderr for console applications.
  2. Change leptonica. It could offer an interface for setting the handler for warnings, and tesseract could use that interface to set the handler. Or leptonica could set a reasonable default for GUI / console applications.
  3. Change tesseract. That's what I did in my PR.

IMHO in the long run the first alternative would be the best solution, but i see no chance to get it quickly.

@stweil
Copy link
Member Author

stweil commented Jul 17, 2016

@egorpugin, I added a commit which handles the libtiff dependency, so the code will not break if tiffio.h is unavailable.

@egorpugin
Copy link
Contributor

Yes, but the code won't be active almost always. Users should care about tiff library, download it, deploy, link to 'tesseractmain' or their own binary.
Also what is HAVE_TIFFIO_H? It is not defined in tesseract build system, so it will be turned off.

Better solution I provided when building tesseract with CPPAN.
See:
#209 (bottom comments)
https://github.com/tesseract-ocr/tesseract/wiki/Compiling#master-branch-305-and-later
and
https://github.com/DanBloomberg/leptonica/blob/master/cppan.yml#L62 this line will disable gui notifications from tiff library completely.

@stweil
Copy link
Member Author

stweil commented Jul 17, 2016

HAVE_TIFFIO_H is set by the current tesseract code – both by configure and by cmake (I tested both variants before adding the last commit).

@egorpugin
Copy link
Contributor

Ah, I see now, sorry.
Well, it's ok then.

@zdenop zdenop merged commit d5b7f68 into tesseract-ocr:master Jul 17, 2016
@stweil stweil deleted the windows branch July 17, 2016 12:12
@jbreiden
Copy link
Contributor

Wait a second. Does this mean it is okay for me to use direct calls to libtiff if I
put it inside HAVE_TIFFIO_H? If so, I can go make the multipage TIFF reader
vastly faster.

@zdenop
Copy link
Contributor

zdenop commented Jul 18, 2016

yes

On 18 Jul 2016 19:23, "jbreiden" notifications@github.com wrote:

Wait a second. Does this mean it is okay for me to use direct calls to
libtiff if I
put it inside HAVE_TIFFIO_H? If so, I can go make the multipage TIFF reader
vastly faster.


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#367 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAjCzNcT2CXvmIoUIyBJdt-DlI4Ur6ueks5qW7aNgaJpZM4JN-Iw
.

@egorpugin
Copy link
Contributor

My vote for removing HAVE_TIFFIO_H (and removing this commit).
It is not used anywhere in tesseract except this commit.
Tesseract uses leptonica API, so maybe it's better to contribute there and use its API later.

@egorpugin
Copy link
Contributor

It seems HAVE_TIFFIO_H was a legacy check, that was automatically imported to cmake to save compatibility with autotools build. Now I see that it's not used.

@jbreiden
Copy link
Contributor

jbreiden commented Jul 18, 2016

Well, this is a core decision. If and only if Tesseract allows a direct TIFF
dependency, then I am going to use it to significantly speed up OCR of
multipage TIFF as part of bug #233.

@egorpugin
Copy link
Contributor

egorpugin commented Jul 18, 2016

Is it possible via leptonica API? I see some comments from Dan B. there.

Upd.: Ahh, Dan said that you really need direct libtiff dependency. No more questions from me.

@jbreiden
Copy link
Contributor

Yeah, Dan and I talked about this extensively and eventually came to agreement. It does not seem possible through Leptonica, even if we modify Leptonica. The issue is there is a libtiff data structure that must be held in memory, and must be iterated through to get successive images in multipage TIFF. Leptonica doesn't have the ability to keep this data structure around, because Leptonica is functional C program. So you can't stash it as a member variable as a Leptonica class or something like that. And we're sure as heck not going to put it in a global because that isn't threadsafe. Which means that, currently, Leptonica has to start over from the beginning every time. When someone asks image 2348 in a multipage TIFF, we have to start at the beginning and do 2348 seeks to get to data.

@jbreiden
Copy link
Contributor

jbreiden commented Jul 18, 2016

I suppose it is possible we are wrong. I'll look one last time, starting here. Sorry for hijacking this bug.

http://www.libtiff.org/libtiff.html#Dirs
http://www.asmail.be/msg0054682091.html

@zdenop
Copy link
Contributor

zdenop commented Jul 18, 2016

@egorpugin: libtiff is needed for tesseract opencl implementation. So you need to remove much more than this commit.

@egorpugin
Copy link
Contributor

It's strange a bit. Does OpenCL really need this? I.e. leptonica can not provide required interfaces? Or opencl is only implemented for tiff images?

Also with CPPAN build I mentioned above tiff dependency can be added in the simplest possible way (add one line to cppan.yml).


I'm trying to understand what do you need to get custom pages (images, directories) from tiff image. Looks like it's TIFFSetDirectory(TIFF *tif, int dir_n) [1,2] call. It reads image file from the beginning, iterates over each directory until requested dir number and reads it (TIFFReadDirectory(tif) [3]). Then it can be read by leptonica with pixReadFromTiffStream(tif) like here [4]. I do not understand why leptonica has one more cycle here [5]. It gives us O(N^2) complexity instead O(N), isn't it?

So, for example if we want to read custom dirs (images) from tiff via leptonica, the complexity is O(N). But if we want to read tiff sequentially we can add a new function to lept. For example, pixReadStreamTiffNext() which will iterate to the next dir (image) in the tiff stream via one TIFFReadDirectory() call because it increases tif->tif_nextdiroff offset.

Correct me if I'm wrong somewhere.

[1] http://www.libtiff.org/man/TIFFSetDirectory.3t.html
[2] https://github.com/vadz/libtiff/blob/master/libtiff/tif_dir.c#L1531
[3] http://www.libtiff.org/man/TIFFReadDirectory.3t.html
[4] https://github.com/DanBloomberg/leptonica/blob/master/src/tiffio.c#L375
[5] https://github.com/DanBloomberg/leptonica/blob/master/src/tiffio.c#L409

@jbreiden
Copy link
Contributor

jbreiden commented Jul 18, 2016

The OpenCl code (amongst other things) provides hardware accelerated alternatives to a very small subset of Leptonica calls.

You are right, Leptonica's pixReadStreamTiff() is adding lots of unnecessary overhead. The only reason we didn't notice it was all this stuff gets cached in memory, hence the massive amount of seeking is a little bit hidden. I already have a patch out to Dan to call TIFFSetDirectory() only once per image. However, even a single call to TIFFSetDirectory() is expensive.

Your suggestion about calling TIFFReadDirectory(TIFF *tif) to get down to linear will not work unless someone is able to hold onto the TIFF *tif struct. Tesseract can do this, but only if it is allowed to directly link to libtiff. Leptonica cannot do this, because it has nowhere persistent to hold it. Everything is completely reset for every image read, including the call to TIFFOpen. All that said, I now see that we can potentially work around this by having Tesseract keep track of file offsets. Need to think this over.

#include <stdio.h>
#include <tiffio.h>

const char *testfile = "test.tiff";

size_t PrimeThePump() {
  TIFF *tiff = TIFFOpen(testfile, "r");
  TIFFSetDirectory(tiff, 0);
  size_t offset = TIFFCurrentDirOffset(tiff);
  TIFFClose(tiff);
  return offset;
}

size_t ThankYouSirMayIHaveAnother(size_t offset) {
  TIFF *tiff = TIFFOpen(testfile, "r");
  TIFFSetSubDirectory(tiff, offset);
  TIFFReadDirectory(tiff);
  offset = TIFFCurrentDirOffset(tiff);
  TIFFClose(tiff);
  return offset;
}

int main(void) {
  size_t offset = PrimeThePump();
  while (offset = ThankYouSirMayIHaveAnother(offset)) {
    printf("offset=%lu\n", offset);
  }
}

@stweil
Copy link
Member Author

stweil commented Jul 18, 2016

Alternative 1 (see my comments from yesterday) is addressed by this libtiff issue: http://bugzilla.maptools.org/show_bug.cgi?id=2571.

I don't expect that a libtiff version with that modification will be available soon.

@egorpugin
Copy link
Contributor

egorpugin commented Jul 18, 2016

So I see the following possible solution. We need to query all offsets in tiff file atleast once. We could do this in leptonica. Also we can add a function that will read tiff image by dir offset.

Code flow:

// tesseract side
FILE *fp = fopen(filename, "rb");
int n_dirs = 0;

// call to leptonica
// fp is opened, leptonica will call fast? TIFFClientOpen()
uint64_t *offsets = pixReadTiffOffsets(fp, &n_dirs);

// read interesting image O(1)
// fp is opened, leptonica will call fast? tif = TIFFClientOpen()
// TIFFSetSubDirectory(tif, offset)
// TIFFReadDirectory(tif)
// pixReadFromTiffStream()
PIX *image = pixReadTiffDir(fp, offsets[interesting_image_number]);

// free() offsets as C does not have C++ delete[]
free(offsets);

What do you think?

@jbreiden
Copy link
Contributor

jbreiden commented Jul 18, 2016

I dunno; we almost always read all the images in a loop, so putting functions in Leptonica somewhat similar to my example would also do the trick. This is ultimately Dan's decision. But also consider that it's kind of a logistics pain to put something new in Leptonica and then make Tesseract depend on it. So for Tesseract, we may be better off doing the direct libtiff calls.

[EDIT: Took a look at this. Impossible to keep everything on Tesseract's side due to pixReadFromTiffStream() is not exposed]

@amitdo
Copy link
Collaborator

amitdo commented May 12, 2017

This is the only place we use libtiff directly now.

Can we move it to Leptonica?

CC: @DanBloomberg

@amitdo
Copy link
Collaborator

amitdo commented May 12, 2017

stweil commented on Jul 17, 2016

Change leptonica. It could offer an interface for setting the handler for warnings, and tesseract could use that interface to set the handler. Or leptonica could set a reasonable default for GUI / console applications.

@stweil
Copy link
Member Author

stweil commented May 12, 2017

Dan recently added a handler which suppresses TIFF warnings completely. So we have to wait for a new Leptonica release.

@amitdo
Copy link
Collaborator

amitdo commented May 22, 2017

1.72.2 is here, so we may able to drop this code soon.

We need to wait to an update of Mingw-w64's Leptonica PKGBUILD.

Egor will probably update cppan very soon.

@egorpugin
Copy link
Contributor

Added 1..74.2, tess will automatically use it now.

zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
win32: Show TIFF warnings on console
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
win32: Show TIFF warnings on console
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
win32: Show TIFF warnings on console
zvezdochiot pushed a commit to ImageProcessing-ElectronicPublications/tesseract that referenced this pull request Mar 28, 2021
win32: Show TIFF warnings on console
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants