Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract 4 cannot use anything other than --oem 0 #1043

Closed
nickbe opened this issue Jul 18, 2017 · 50 comments
Closed

Tesseract 4 cannot use anything other than --oem 0 #1043

nickbe opened this issue Jul 18, 2017 · 50 comments
Labels

Comments

@nickbe
Copy link

nickbe commented Jul 18, 2017

Platform is Debian Jessie - Tesseract 4.00 Git Version.
Platform: Linux localhost 4.4.27-x86_64-jb1 #4 SMP Tue Jun 6 14:41:09 CEST 2017 x86_64 GNU/Linux

Tesseract crashes with "Illegal Instruction" when using anything other than --oem 0

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics
Illegal instruction

Tesseract -v reports

tesseract 4.00.00alpha
 leptonica-1.74.4
  libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8 : libopenjp2 2.1.0

 Found AVX
 Found SSE

I can scan with --oem 0 though.

@Shreeshrii
Copy link
Collaborator

what is the version of your traineddata files? Download latest version from the tessdata repo.

@nickbe
Copy link
Author

nickbe commented Jul 18, 2017

Ok, so now I reinstalled tesseract just to make sure I did everything right.
Tessdata files like 'eng.traineddata' have now been downloaded directly from the repo into /usr/local/share/tessdata

Current content:
configs deu.traineddata eng.traineddata pdf.ttf tessconfigs

Now Tesseract starts but tells me that it can't load any language. Which is quite odd.

tesseract --tessdata-dir /usr/local/share/tessdata/tessdata -l eng test.jpg out
results in:

Error opening data file /usr/local/share/tessdata/eng.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.

tesseract -l eng test.jpg out
results in:
Error opening data file /usr/local/share/eng.traineddata

and
tesseract --tessdata-dir /usr/local/share/tessdata -l eng test.jpg out
also results in:
Error opening data file /usr/local/share/eng.traineddata

And whatever I set the TESSDATA_PREFIX to, (like TESSDATA_PREFIX=/usr/share/tesseract-ocr/tessdata) does not get honored at all.
I simply don't get it. What's going on here?

@nickbe
Copy link
Author

nickbe commented Jul 18, 2017

Ok, I solved the language problem. After unsetting TESSDATA_PREFIX and simply using:
wget https://github.com/tesseract-ocr/tessdata/raw/4.00/deu.traineddata
Tesseract seems to be able to load the language files from the default /usr/local/share/tessdata again.

But still --oem 1 results in:

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics
Illegal instruction

@nickbe
Copy link
Author

nickbe commented Jul 18, 2017

When using the data files from:
git clone --depth=1 https://github.com/tesseract-ocr/tessdata.git tessdata-repo
tesseract fails to load the language files.

But when using the data files from: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
by downloading with:
wget https://github.com/tesseract-ocr/tessdata/raw/4.00/eng.traineddata
I can start tesseract with --oem 0, but --oem 1 or --oem 2 results in the illegal instruction message

Both ways I put the files into /usr/local/share/tessdata

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jul 19, 2017

Test with the tif file in testing directory. It works ok for me.
My traineddata files are in ../tessdata directory

# tesseract phototest.tif phototest --tessdata-dir ../
Tesseract Open Source OCR Engine v4.00.00dev-2067 with Leptonica
Page 1

# tesseract phototest.tif phototest --tessdata-dir ../ --oem 1
Tesseract Open Source OCR Engine v4.00.00dev-2067 with Leptonica
Page 1

# tesseract phototest.tif phototest --tessdata-dir ../ --oem 2
Tesseract Open Source OCR Engine v4.00.00dev-2067 with Leptonica
Page 1


# tesseract -v
tesseract 4.00.00dev-2067
 leptonica-1.74.4
  libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

 Found AVX
 Found SSE

@Shreeshrii
Copy link
Collaborator

When you say ' Tesseract 4.00 Git Version' I take it to mean that you are using the latest source from github to build tesseract.

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

That's correct.

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

Please test tesseract with phototest.tif, as Shree suggested.
https://github.com/tesseract-ocr/tesseract/blob/master/testing/phototest.tif

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

OK. I tested it with the traineddata above. But also it's the same I'm using here.
I also confirmed that tesseract in indeed using the right data folder.

But again the phototest.tif works fine with --oem 0 and results in the same error "illegal instructions" for any other --oem option or none (default should be --oem 2 if I'm not mistaken)

And although compilation seemed fine. I didn't see an error or warning. So I guess there must be some library missing here.

Also I reinstalled Leptonica and Tesseract multiple times now.

Here's how I've installed the tools:

1. Make sure that the following libraries are installed:

       # nickbe:  I had to replace libpng12-dev for debian jessie

	apt-get install autoconf-archive automake g++ libtool libleptonica-dev pkg-config
	apt-get install libpango1.0-dev

	# sudo apt-get install g++ # or clang++
	sudo apt-get install autoconf automake libtool
	sudo apt-get install autoconf-archive
	sudo apt-get install pkg-config
	sudo apt-get install libpng12-dev
	sudo apt-get install libjpeg-turbo
	sudo apt-get install libtiff5-dev
	sudo apt-get install zlib1g-dev

	sudo apt-get install libicu-dev
	sudo apt-get install libpango1.0-dev
	sudo apt-get install libcairo2-dev

2. Install Leptonica:

	git clone --depth 1 https://github.com/DanBloomberg/leptonica.git leptonica
	cd leptonica
	./autobuild
	./configure
	make
	sudo make install
	ldconfig

3. Install Tesseract:

    git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git tesseract-ocr
    cd tesseract-ocr
    ./autogen.sh

    ./configure --disable-openmp --disable-shared --disable-static
    or
    ./configure        # nickbe: I TESTED BOTH CONFIGURATIONS JUST TO MAKE SURE
    make

    sudo make install
	sudo ldconfig

	# sudo make training
	# sudo make training-install

	sudo make install-langs      # nickbe: Never does anything so far
      sudo ldconfig

4. wget tessdata from https://github.com/tesseract-ocr/tesseract/wiki/Data-Files
   to /usr/local/share/tessdata

   Example: wget https://github.com/tesseract-ocr/tessdata/raw/4.00/eng.traineddata

@Shreeshrii
Copy link
Collaborator

Shreeshrii commented Jul 19, 2017 via email

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

Do not install libleptonica-dev with apt-get, since you manually intsall leptonica later.

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

apt-get uninstall libleptonica-dev

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

OK I enabled debug. I also installed the gdb package, but I have no experience with it. How can I provide more information?

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

sudo make install-langs # nickbe: Never does anything so far

The comment is correct, so there's no point in doing that.

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

I think one quite important information is that I just installed the package on a fresh instance of debian stretch. I had no problems with installating Leptonica or Tesseract, but after everything was installed I have exactly the same behaviour on this machine. Tesseract runs with --oem 0 but throws the "illegal instruction" message when trying to use --oem 2 or 1.

Seems that there's something very profound missing in the installation procedure.

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

Ok. I managed to run Tesseract with gdb. Here's the output:


(gdb) set args -l eng --oem 2 test.png out
(gdb) run
Starting program: /usr/local/bin/tesseract -l eng --oem 2 test.png out
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics

Program received signal SIGILL, Illegal instruction.
tesseract::DotProductAVX (u=0x127aa70, v=0x51dbc60, n=25) at dotproductavx.cpp:70
70            __m256d floats2 = _mm256_loadu_pd(v);
(gdb)

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

./configure --disable-openmp --disable-shared --disable-static

#898 (comment)
#943 (comment)

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

Change this line in configure.ac
AX_CHECK_COMPILE_FLAG([-mavx], [avx=true], [avx=false])
to
AX_CHECK_COMPILE_FLAG([-mavx], [avx=false], [avx=false])

and recompile tesseract again.

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

btw. I already uninstalled libleptonica-dev before.

Do I have to "make uninstall" before recompiling?

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

Do I have to "make uninstall" before recompiling?

You mean make uninstall tesseract ?

You don't have to in this case.

Also, what's the output of cat /proc/cpuinfo | grep flags ?

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

flags           : fpu tsc msr pae cx8 apic cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrand hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid
flags           : fpu tsc msr pae cx8 apic cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrand hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid
flags           : fpu tsc msr pae cx8 apic cmov pat clflush mmx fxsr sse sse2 ss syscall nx lm constant_tsc rep_good nopl pni pclmulqdq vmx ssse3 cx16 sse4_1 sse4_2 popcnt aes f16c rdrand hypervisor lahf_lm tpr_shadow vnmi flexpriority ept vpid


@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

After recompiling everything with the changed flag the new output is:

Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
Detected 35 diacritics
DotProductAVX can't be used on Android
DotProductAVX can't be used on Android
Aborted

Android?!

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

According to the output of cat /proc/cpuinfo | grep flags, your cpu does not support avx.

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

The latest recompile was already done with the modified configure.ac.
That was the output when running: tesseract -l eng --oem 2 ......
As always --oem 0 works.

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

OK. You will have to do another change in the code.

I will tell you later/tomorrow what to do next.

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

In arch/simddetect.h

Change this line
static inline bool IsAVXAvailable() { return detector.avx_available_; }
to
static inline bool IsAVXAvailable() { return false; }

I hope we will finish with this change :-)

@stweil
Copy link
Contributor

stweil commented Jul 19, 2017

It's strange that tesseract -v reports Found AVX while your CPU obviously does not support AVX (see output of /proc/cpuinfo. That's causing the crash which you observe. What kind of CPU are you using? Are you running on a virtual machine?

Could you use the GDB debugger to step through the function SIMDDetect::SIMDDetect (in arch/simddetect.cpp) which is executed right at the beginning? Maybe you have a buggy __get_cpuid function (or a buggy virtual machine). Try to print the value of ecx which is set by that function.

Removing the code avx_available_ = (ecx & 0x10000000) != 0; will work around the problem and fix the crash. The change suggested by @amitdo will have the same effect.

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

and recompile of course.

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

Do what I said before listening to @stweil :-)

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

@stweil

Yes. It's strange.
I want to make sure the problem will be solved after disabling (cheating) avx detection.
If that happen, nickbe will need to undo the 2 changes and recompile. Then you will do your analysis...

@amitdo
Copy link
Collaborator

amitdo commented Jul 19, 2017

What kind of CPU are you using?

cat /proc/cpuinfo | grep name

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

It's a vServer. Probably XEN but I'm not sure. We do use them quite often without problems. So I have no idea why this case is indeed so strange. I'm recompiling now...

@nickbe
Copy link
Author

nickbe commented Jul 19, 2017

Yay. It's working finally. Thanks you so much guys 💃 Will the changes in the make make it into the official repository.?

So now that I can in fact test the new 4.0 feats, is there a way to speed up scanning? Any switches that are recommended?

@stweil
Copy link
Contributor

stweil commented Jul 20, 2017

Will the changes in the make make it into the official repository?

No, they won't, because those changes disable AVX support which is highly desired: AVX makes Tesseract faster. The problem is most probably caused by your vServer which returns a wrong cpuid. That cpuid claims that your vServer supports AVX, but it does not. You can try to get more more information on that vServer (is it XEN, which version?) and report the problem.

We could add a Tesseract option to select SSE / AVX (overriding the automatic detection). Then Tesseract would still crash by default in your case, but it would be possible to make it work using that new option.

@nickbe
Copy link
Author

nickbe commented Jul 20, 2017

Is this something new to the 4.00 version? Because the 3.x Versions ran just fine.

@stweil
Copy link
Contributor

stweil commented Jul 20, 2017

Yes, it's new. AVX is used for the calculation of the dot product which is needed for LSTM (new in 4.00, not used with --oem 0).

@nickbe
Copy link
Author

nickbe commented Jul 20, 2017

Maybe there's a safer method to detect the capability? Can I find out if other methods show the correct capabilities for you guys?
If you like I'd be happy to grant you access to the server.

@amitdo
Copy link
Collaborator

amitdo commented Jul 20, 2017

#1043 (comment)

@nickbe
Copy link
Author

nickbe commented Jul 23, 2017

No I meant maybe there's a better and more secure way for you guys to recognize these kind of features

@stweil
Copy link
Contributor

stweil commented Jul 23, 2017

@nickbe, you could help by providing more information on the kind of vServer which you were using.

@nickbe
Copy link
Author

nickbe commented Jul 23, 2017

Sure.
https://www.df.eu/de/cloud-hosting/
Currently it's the second smallest vServer

@stweil
Copy link
Contributor

stweil commented Jul 24, 2017

I just wrote to Domain Factory (in German, translated here):

One of your customers has reported a problem with the OCR application
Tesseract: #1043 (comment)

The cause of the crash seems to be the CPUID seen from the vServer guest. That CPUID does not fit the real hardware:

According to CPUID, the CPU supports AVX operations. In fact, these lead to a crash.

Doesn't your hardware support AVX (maybe an older XEON CPU)? Probably the VM of the customer migrated from newer hardware (with AVX) to an older hardware (without AVX), and now it still uses the CPUID of the newer hardware.

What do you advise users in this case?
You can also reply directly to GitHub (URL above).

XEN can set the CPUID seen by guests to avoid exactly that kind of problem: it can mask the AVX bit even when running on a new CPU with AVX support, thus allowing migration to an older CPU.

@stweil
Copy link
Contributor

stweil commented Jul 24, 2017

@nickbe, could you please also run cpuid --one-cpu and cpuid --one-cpu --raw and post the output?

@stweil
Copy link
Contributor

stweil commented Jul 24, 2017

Nick, Domain Factory support asks for the name of your Jiffy Box. Could you send me your e-mail address (get my address here)? Then I'll forward their request to you.

@amitdo
Copy link
Collaborator

amitdo commented Sep 10, 2017

@nickbe, did you manage to solve the issue?

@stweil
Copy link
Contributor

stweil commented Sep 10, 2017

@amitdo, I had contacted Nick's provider. They use XEN servers which do not support AVX, but the CPUID which is seen from the vServer claims that AVX is available. As far as I have understood, this happens when a XEN vServer initially runs on a server with AVX, but is migrated to another server without AVX later.

Only the provider can handle that correctly. Either the XEN vServer must always run on servers with AVX, or the XEN configuration must disable the AVX settings in CPUID even if the server has AVX support.

On the Tesseract side we could try to get a more robust AVX detection which not only checks CPUID. In addition we need an option or parameter to override the automatic selection of SSE2 / AVX.

@amitdo
Copy link
Collaborator

amitdo commented Sep 10, 2017

Ok, @stweil. Thanks for the info.

@nickbe
Copy link
Author

nickbe commented Sep 11, 2017

hi guys, yes I successfully solved the problem by following your instruction to patch the settings.
Thanks again for your support here. Very appreciated indeed :)

@zdenop zdenop closed this as completed Sep 12, 2017
@ken4ward
Copy link

What does oem mean, and how do I set it in my java project?

@amitdo amitdo added the SIMD label Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants