-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
win32: Show TIFF warnings on console #367
Conversation
Showing them in a window (default) is not acceptable for a console application like Tesseract which must be able to work in batch mode. Signed-off-by: Stefan Weil <sw@weilnetz.de>
TIFF gui warnings probably should be turned off during tiff library compilation with definition TIF_PLATFORM_CONSOLE. |
All precompiled TIFF libraries for Windows which I know (e .g. the mingw64-i686-tiff which is part of cygwin) don't define TIF_PLATFORM_CONSOLE. I don't think that it would be a good idea to require using a special TIFF library for Tesseract. So yes, in theory your suggestion is an alternative solution, but it is not feasible in practice. |
Your commit introduces dependency on tiff library for tesseract. |
I think there is another bug about better memory usage with multipage tiff, where the solution is also a tiff library dependency. Will try to find it. |
Yes, it was bug #223. There is an enhancement request to improve performance on multipage tiff reads. It is currently slow to read the latter images in multipage tiff. The only way to accomplish this is with a direct libtiff dependency. I didn't implement because I wasn't sure if it was worth it. |
I mean bug #233 |
The previous commit added a dependency on tiffio.h, so enable the new code only if that file is available. The code which conditionally defines HAVE_TIFFIO_H was already there although that macro was unused up to now. Signed-off-by: Stefan Weil <sw@weilnetz.de>
Some additional notes on the problem which is addressed by this PR: It looks like many TIFF files (nearly all?) include some vendor specific data which trigger warnings from libtiff, so it is a real and very common problem for Windows users who want to do batch processing. I see these alternatives to handle the problem:
IMHO in the long run the first alternative would be the best solution, but i see no chance to get it quickly. |
@egorpugin, I added a commit which handles the libtiff dependency, so the code will not break if tiffio.h is unavailable. |
Yes, but the code won't be active almost always. Users should care about tiff library, download it, deploy, link to 'tesseractmain' or their own binary. Better solution I provided when building tesseract with CPPAN. |
HAVE_TIFFIO_H is set by the current tesseract code – both by configure and by cmake (I tested both variants before adding the last commit). |
Ah, I see now, sorry. |
Wait a second. Does this mean it is okay for me to use direct calls to libtiff if I |
yes On 18 Jul 2016 19:23, "jbreiden" notifications@github.com wrote:
|
My vote for removing HAVE_TIFFIO_H (and removing this commit). |
It seems HAVE_TIFFIO_H was a legacy check, that was automatically imported to cmake to save compatibility with autotools build. Now I see that it's not used. |
Well, this is a core decision. If and only if Tesseract allows a direct TIFF |
Is it possible via leptonica API? I see some comments from Dan B. there. Upd.: Ahh, Dan said that you really need direct libtiff dependency. No more questions from me. |
Yeah, Dan and I talked about this extensively and eventually came to agreement. It does not seem possible through Leptonica, even if we modify Leptonica. The issue is there is a libtiff data structure that must be held in memory, and must be iterated through to get successive images in multipage TIFF. Leptonica doesn't have the ability to keep this data structure around, because Leptonica is functional C program. So you can't stash it as a member variable as a Leptonica class or something like that. And we're sure as heck not going to put it in a global because that isn't threadsafe. Which means that, currently, Leptonica has to start over from the beginning every time. When someone asks image 2348 in a multipage TIFF, we have to start at the beginning and do 2348 seeks to get to data. |
I suppose it is possible we are wrong. I'll look one last time, starting here. Sorry for hijacking this bug. http://www.libtiff.org/libtiff.html#Dirs |
@egorpugin: libtiff is needed for tesseract opencl implementation. So you need to remove much more than this commit. |
It's strange a bit. Does OpenCL really need this? I.e. leptonica can not provide required interfaces? Or opencl is only implemented for tiff images? Also with CPPAN build I mentioned above tiff dependency can be added in the simplest possible way (add one line to cppan.yml). I'm trying to understand what do you need to get custom pages (images, directories) from tiff image. Looks like it's So, for example if we want to read custom dirs (images) from tiff via leptonica, the complexity is O(N). But if we want to read tiff sequentially we can add a new function to lept. For example, Correct me if I'm wrong somewhere. [1] http://www.libtiff.org/man/TIFFSetDirectory.3t.html |
The OpenCl code (amongst other things) provides hardware accelerated alternatives to a very small subset of Leptonica calls. You are right, Leptonica's pixReadStreamTiff() is adding lots of unnecessary overhead. The only reason we didn't notice it was all this stuff gets cached in memory, hence the massive amount of seeking is a little bit hidden. I already have a patch out to Dan to call TIFFSetDirectory() only once per image. However, even a single call to TIFFSetDirectory() is expensive. Your suggestion about calling TIFFReadDirectory(TIFF *tif) to get down to linear will not work unless someone is able to hold onto the TIFF *tif struct. Tesseract can do this, but only if it is allowed to directly link to libtiff. Leptonica cannot do this, because it has nowhere persistent to hold it. Everything is completely reset for every image read, including the call to TIFFOpen. All that said, I now see that we can potentially work around this by having Tesseract keep track of file offsets. Need to think this over. #include <stdio.h>
#include <tiffio.h>
const char *testfile = "test.tiff";
size_t PrimeThePump() {
TIFF *tiff = TIFFOpen(testfile, "r");
TIFFSetDirectory(tiff, 0);
size_t offset = TIFFCurrentDirOffset(tiff);
TIFFClose(tiff);
return offset;
}
size_t ThankYouSirMayIHaveAnother(size_t offset) {
TIFF *tiff = TIFFOpen(testfile, "r");
TIFFSetSubDirectory(tiff, offset);
TIFFReadDirectory(tiff);
offset = TIFFCurrentDirOffset(tiff);
TIFFClose(tiff);
return offset;
}
int main(void) {
size_t offset = PrimeThePump();
while (offset = ThankYouSirMayIHaveAnother(offset)) {
printf("offset=%lu\n", offset);
}
} |
Alternative 1 (see my comments from yesterday) is addressed by this libtiff issue: http://bugzilla.maptools.org/show_bug.cgi?id=2571. I don't expect that a libtiff version with that modification will be available soon. |
So I see the following possible solution. We need to query all offsets in tiff file atleast once. We could do this in leptonica. Also we can add a function that will read tiff image by dir offset. Code flow: // tesseract side
FILE *fp = fopen(filename, "rb");
int n_dirs = 0;
// call to leptonica
// fp is opened, leptonica will call fast? TIFFClientOpen()
uint64_t *offsets = pixReadTiffOffsets(fp, &n_dirs);
// read interesting image O(1)
// fp is opened, leptonica will call fast? tif = TIFFClientOpen()
// TIFFSetSubDirectory(tif, offset)
// TIFFReadDirectory(tif)
// pixReadFromTiffStream()
PIX *image = pixReadTiffDir(fp, offsets[interesting_image_number]);
// free() offsets as C does not have C++ delete[]
free(offsets); What do you think? |
I dunno; we almost always read all the images in a loop, so putting functions in Leptonica somewhat similar to my example would also do the trick. This is ultimately Dan's decision. But also consider that it's kind of a logistics pain to put something new in Leptonica and then make Tesseract depend on it. So for Tesseract, we may be better off doing the direct libtiff calls. [EDIT: Took a look at this. Impossible to keep everything on Tesseract's side due to pixReadFromTiffStream() is not exposed] |
This is the only place we use libtiff directly now. Can we move it to Leptonica? CC: @DanBloomberg |
stweil commented on Jul 17, 2016
|
Dan recently added a handler which suppresses TIFF warnings completely. So we have to wait for a new Leptonica release. |
1.72.2 is here, so we may able to drop this code soon. We need to wait to an update of Mingw-w64's Leptonica PKGBUILD. Egor will probably update cppan very soon. |
Added 1..74.2, tess will automatically use it now. |
win32: Show TIFF warnings on console
win32: Show TIFF warnings on console
win32: Show TIFF warnings on console
win32: Show TIFF warnings on console
Showing them in a window (default) is not acceptable for a console
application like Tesseract which must be able to work in batch mode.
Signed-off-by: Stefan Weil sw@weilnetz.de