-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Troubles with Tesseract/Leptonica presets #36
Comments
Great to hear that you guys are using JavaCPP! Thanks for the feedback. So, let's see if we can get this working properly.
Thanks for your support! It's great to see more and more people finding this project useful. |
Hello Samuel, I've finally managed to get the Tesseract preset working for me. I just want to answer your questions for the sake of completeness. Besides the fact the whole build system works I've discovered several things need to be improved.
It fails with java.lang.UnsatisfiedLinkError because two native libraries - libpng.so.16 and libjpeg.so.62 - leptonica was linked with couldn't be found. I had to compile them from sources in order to solve this issue because my recent Ubuntu 14.04 does still include ancient versions of these libraries.
OK, I naturally expect the following code to work because there is a variant of pixReadMemTiff that accepts ByteBuffer as first argument:
This one works as expected:
Hmm, what I did manually was a PIX wrapper class accepting and passing opaque PIX object to and from Leptonica:
And yes, Leptonica's PIX was exposed as a part of Tesseract API. Is it possible to link jnitesseract to both tesseract and leptonica? This way only necessary methods will be wrapped and no dead code will be loaded into memory.
The fix works, thanks!
It does surprisingly appear as BoolPointer after the last fix 👍 )) Finally, I just want to mention two issues needed to be urgently adressed/fixed: Leptonica dependes on several native libraries (libtiff, libgiff, libpng, libjpeg, zlib etc.) in order to get work properly. These dependencies should be present in a target system before the cppbuild.sh script is executed. Moreover, the installed dependencies should match Leptonica's configuration in "src/environ.h" (#define HAVE_???, for example #define HAVE_LIBTIFF 1). Otherwise, the build may fail in a subtle way. Unfortunately, the supplied cppbuild.sh script doesn't check whether all required dependencies are installed or not. It's possible to build only selected presets. As for me I did the following:
It failed because Tesseract depends on Leptonica but the latter wasn't present in the build directory. This one works though:
Tesseract's cppbuild.sh need to be modified to process Leptonica first. It all looks like a limitation of the current build system (Bash/Maven). Do you still look forward to replace it with Gradle? Best regards |
I've added As for About NIO buffers, currently only direct ones are supported. It would be convenient to support non-direct ones as well, but they would incur additional overhead, so I just have not made a priority out of them. Besides, as you found out, we can simply create a new The build system is quite hackish, yes, but there is no precedent for this kind of tool. AFAIK, we're basically creating something that no one has ever attempted to do before! On any platform with languages such as Java, Python, Ruby, C#, JavaScript, etc, this is a first in history. At this point in time, Gradle seems like the most promising alternative, but someone needs to try and make it work. Would you yourself be interested in undertaking that challenge? In any case, the build system isn't intended for end users. As long as the binary artifacts work with normal Maven builds on target platforms, whatever needs to happen for the native compilation phase with Bash, Gradle, etc should not matter. We could of course have custom build options that could, for example, create a merged Leptonica/Tesseract artifact, and I would be glad to reflect the changes in the source code, but the binary artifacts would not be uploaded to the Central Repository. Does that all make sense? |
Hello Samuel, thank you for your investigation and patches. Please refer to my inline comments below.
I didn't try your recent patch but using BoolPointer has worked for me before. Regarding big-endian machines I could give a try - I still own a working iMac from 2005 equipped with G5 PowerPC processor that I used for catching endiannes-related bugs for FFMpeg and Tesseract. This test has low priority now because noone seems to use such machines at the time being.
Well, it depends on the produced overhead. We usually call
I'm not a Gradle expert but I'm prepared to give a try. The biggest challenge would be the compilation of native libraries. But this is something we should discuss somewhere else (E-Mail, Chat, Skype whatever). Feel free to contact me at - maximumspatium at googlemail dot com. Best regards |
The overhead to use anything non-direct from JNI is pretty much always 1) memory allocation on the native heap and 2) a data copy. That is what is happening right now, but it doesn't it do it automatically for non-direct NIO buffers, that's all. The biggest hurdle that I see in adopting something else than Bash is to find a replacement for shell commands commonly used like I think the mailing list would be appropriate for discussion, but sure, private messages are fine too! Thanks BTW, if what you need urgently is a smaller JNI library, it would probably be easier to modify the existing |
What changes are necessary for non-direct NIO buffers to work?
Yes, it would be nice to have that. What's needed to be modified?
Does JavaCPP project have a mailing list somewhere or do you mean issue comments? Thank you in advance! |
We'd need to add things here and there in
Like I said, to modify the native library files, we need to modify the The mailing list is here: https://groups.google.com/group/javacpp-project |
…ts non-direct ones backed by arrays (issue bytedeco/javacpp-presets#36)
As indicated above, I've added support for non-direct NIO buffers. Could you confirm that this change works well with your application? Thanks! |
Hello Samuel,
Thank you very much for the patch! I've recompiled both projects from source and tested with our app. The following line works now:
It does look much nicer to me. As to speed, I cannot notice any difference. I had two further improvements: Leptonica comes with several additional programs in the prog subdirectory (regression tests, examples etc.). They usually aren't required for the library itself. IIUC, JavaCPP doesn't use them either. The following, simple patch switches off compilation of these additional programs using --disable-programs option. This speeds up Leptonica compilation abit and consumes less disk space. Does it make sense to bump JavaCPP version, to said 0.11-SNAPSHOT? The current code is beyond the scope of the 0.10 release. It would it easier to test new commits from local Maven repository by bumping the dependency version. Best regards |
If you can make that patch available as a pull request, I'll merge it right away, thanks! One of the main issues I'm having with Leptonica though is the reflection API from the JDK slowing down to a crawl on large classes. If you figure out a way to work around that one, let me know, thanks! As for the version number, it's because I still find this system a bit inconvenient... I plan to bump it when I start making incompatible changes, pretty soon now ;) |
Hello Samuel,
I'm not quite sure if we both mean the same but I noticed that Leptonica's JNI library does require ca. 28 minutes to build while Tesseract does require less than one minute. It's not quite clear to me why. Leptonica has been programmed in C so there is no notion of any classes. Best regards |
It's precisely because Leptonica has no notion of class that we end up putting all the functions in one big class in Java:
So, for some reason, it looks like the JDK is having a hard time querying annotations on methods when those methods are in a class with a lot of other methods, and that is what would need to be investigated... |
Most of the above has been fixed in the -0.11 release, so I'll close this issue. Thanks for reporting and testing everything! I think the only two remaining issues of interest are:
Let us discuss these two issues, or anything else I missed, in a new thread... Thanks! |
Hello,
first of all - thank you for bringing us this amazing project! We're using JavaCPP in our music recognition engine (www.audiveris.org) since 2012 for communicating with Tesseract OCR. For this purpose I've programmed an elegant interface manually, based on JavaCPP. It worked for us for several years until now.
For a couple of days I've tried to compile my old code with the current JavaCPP v0.10 and noticed that it doesn't compile anymore. While examining the new library code I've noticed several big changes. The biggest one was the introduction of an automatic system that does scan C++ headers and produce the appropriate Java interface.
I spent several days playing with the new JavaCPP library and its Tesseract presets in order to integrate it with my project. I must admit that I finally gave up on it after a while, not being able to get it work at all.
I would very appreciate it if someone could shed some light on the following issues I encountered with the Tesseract preset:
The documentation on manual installation says "Just put all the desired JAR files somewhere in your CLASSPATH". OK, it doesn't work out-of-box. My Ubuntu 14 and Netbeans 8 simply refuse to load the native libraries from supplied JARs. The native libraries have to be EXTRACTED first, otherwise the famous Unsatisfied Link exception is thrown. Is it the normal behaviour or am I missing something? Is there any dedicated LOAD method I forget to call?
I wasn't able to get Leptonica's pixReadMemTiff function work. It always returns "false" that indicates some internal error. Reading TIFF files does work though. My project relies heavily on the former method because we're constructing images on the fly. I have to debug the native library in order to be able to say more. Any suggestions on why it doesn't work?
Leptonica's native library is huge (2,3 MB) but Tesseract utilizes just a few methods, mostly PIX-related. The rest (over 90%) is never used but will be wastefully kept in memory. Is there any possibility to strip out unused code if the intended usage of Leptonica will be Tesseract alone?
Tesseract API supplies an iterator - ResultIterator - to be used for extraction of recognition results. This ResultIterator is inherited from LTRResultIterator which, in turn, is inherited from PageIterator. So it's basically possible to use this ResultIterator for accessing results at different levels like pages, words or single symbols.
In C++, it's possible to access PageIterator's method Empty() using ResultIterator instance:
It doesn't work in JavaCPP v0.10 because of the famous issue with the multiple inheritance in Java. From what I see in the generated code, ResultIterator extends LTRResultIterator which extends Pointer. Where is PageIterator?
It looks like I need to explicitely request PageIterator instance like this:
With the earlier versions of JavaCPP the C++ way has just worked, without any casts or extra accessors. Am I missing some important technical point here?
I kinda like the idea of automatic Java interface generation. It looks promising at saving us a lot of time. I would like to use this new feature in my project. That's why I'd highly appreciate any help on getting it up and running.
Thank you in advance!
Best regards
Max P.
The text was updated successfully, but these errors were encountered: