Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point exception with tessdata models since version 5.4.0 #4257

Closed
yeezy69 opened this issue Jun 8, 2024 · 6 comments
Closed

Floating point exception with tessdata models since version 5.4.0 #4257

yeezy69 opened this issue Jun 8, 2024 · 6 comments

Comments

@yeezy69
Copy link

yeezy69 commented Jun 8, 2024

Current Behavior

I use OCRmyPDF on Archlinux. The program has been crashing since yesterday after tesseract was updated from version 5.3.4-2 to 5.4.0-1. After a downgrade, tesseract works as expected with the same image.

I executed the ocrmypdf commands manually:

$ gs -dQUIET -dSAFER -dBATCH -dNOPAUSE -dInterpolateControl=-1 -sDEVICE=png16m -dFirstPage=1 -dLastPage=1 -r200.161797x200.161797 -dPDFSTOPONERROR -o image.png -sstdout=%stderr -dAutoRotatePages=/None -f doc20240608121758.pdf

$ tesseract -l deu image.png 000001_ocr_hocr hocr txt
[1] 9771 floating point exception (core dumped)

$ pacman -U tesseract-5.3.4-2-x86_64.pkg.tar.zst

$ tesseract -l deu image.png 000001_ocr_hocr hocr txt
(works fine)

For data protection reasons, I recreated a document that caused the program to crash:

image

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

tesseract 5.4.0
leptonica-1.84.1
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 3.0.2) : libpng 1.6.43 : libtiff 4.6.0 : zlib 1.3.1 : libwebp 1.4.0 : libopenjp2 2.5.2
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Found OpenMP 201511
Found libarchive 3.7.4 zlib/1.3.1 liblzma/5.6.1 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.8.0 OpenSSL/3.3.1 zlib/1.3.1 brotli/1.1.0 zstd/1.5.6 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.0

Operating System

No response

Other Operating System

Archlinux

uname -a

Linux pc 6.9.3-arch1-1 #1 SMP PREEMPT_DYNAMIC Fri, 31 May 2024 15:14:45 +0000 x86_64 GNU/Linux

Compiler

No response

CPU

Intel i7-8650U

Virtualization / Containers

No response

Other Information

$ gdb --args tesseract -l deu image.png 000001_ocr_hocr hocr txt

(gdb) run
Starting program: /usr/bin/tesseract -l deu image.png 000001_ocr_hocr hocr txt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
[New Thread 0x7ffff20006c0 (LWP 10820)]
[New Thread 0x7ffff16006c0 (LWP 10821)]
[New Thread 0x7ffff0c006c0 (LWP 10822)]

Thread 1 "tesseract" received signal SIGFPE, Arithmetic exception.
0x00007ffff7d8afa4 in tesseract::Classify::ComputeNormMatch(int, tesseract::FEATURE_STRUCT const&, bool) () from /usr/lib/libtesseract.so.5

(gdb) bt
#0 0x00007ffff7d8afa4 in tesseract::Classify::ComputeNormMatch(int, tesseract::FEATURE_STRUCT const&, bool) () from /usr/lib/libtesseract.so.5
#1 0x00007ffff7d806c5 in tesseract::Classify::ComputeIntCharNormArray(tesseract::FEATURE_STRUCT const&, unsigned char*) () from /usr/lib/libtesseract.so.5
#2 0x00007ffff7d705ad in tesseract::Classify::ComputeCharNormArrays(tesseract::FEATURE_STRUCT*, tesseract::INT_TEMPLATES_STRUCT*, unsigned char*, unsigned char*) () from /usr/lib/libtesseract.so.5
#3 0x00007ffff7d709fe in tesseract::Classify::CharNormTrainingSample(bool, int, tesseract::TrainingSample const&, std::vector<tesseract::UnicharRating, std::allocatortesseract::UnicharRating >) ()
from /usr/lib/libtesseract.so.5
#4 0x00007ffff7d9312b in tesseract::TessClassifier::UnicharClassifySample(tesseract::TrainingSample const&, tesseract::Image, int, int, std::vector<tesseract::UnicharRating, std::allocatortesseract::UnicharRating >
) () from /usr/lib/libtesseract.so.5
#5 0x00007ffff7d6e575 in tesseract::Classify::CharNormClassifier(tesseract::TBLOB*, tesseract::TrainingSample const&, tesseract::ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.5
#6 0x00007ffff7d73e76 in tesseract::Classify::DoAdaptiveMatch(tesseract::TBLOB*, tesseract::ADAPT_RESULTS*) () from /usr/lib/libtesseract.so.5
#7 0x00007ffff7d6c5a3 in tesseract::Classify::AdaptiveClassifier(tesseract::TBLOB*, tesseract::BLOB_CHOICE_LIST*) () from /usr/lib/libtesseract.so.5
#8 0x00007ffff7e4d1a4 in tesseract::Wordrec::call_matcher(tesseract::TBLOB*) () from /usr/lib/libtesseract.so.5
#9 0x00007ffff7e5aaeb in tesseract::Wordrec::classify_blob(tesseract::TBLOB*, char const*, tesseract::ScrollView::Color, tesseract::BlamerBundle*) () from /usr/lib/libtesseract.so.5
#10 0x00007ffff7e5ac41 in tesseract::Wordrec::classify_piece(std::vector<tesseract::SEAM*, std::allocatortesseract::SEAM* > const&, short, short, char const*, tesseract::TWERD*, tesseract::BlamerBundle*) () from /usr/lib/libtesseract.so.5
#11 0x00007ffff7e4b1d4 in tesseract::Wordrec::chop_word_main(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#12 0x00007ffff7e4b6f2 in tesseract::Wordrec::cc_recog(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#13 0x00007ffff7d273f9 in tesseract::Tesseract::recog_word_recursive(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#14 0x00007ffff7d2854b in tesseract::Tesseract::recog_word(tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#15 0x00007ffff7d28927 in tesseract::Tesseract::tess_segment_pass_n(int, tesseract::WERD_RES*) () from /usr/lib/libtesseract.so.5
#16 0x00007ffff7cda5a2 in tesseract::Tesseract::match_word_pass_n(int, tesseract::WERD_RES*, tesseract::ROW*, tesseract::BLOCK*) () from /usr/lib/libtesseract.so.5
#17 0x00007ffff7ce1102 in tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES) () from /usr/lib/libtesseract.so.5
#18 0x00007ffff7cd0c51 in tesseract::Tesseract::RetryWithLanguage(tesseract::WordData const&, void (tesseract::Tesseract::
)(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES), bool, tesseract::WERD_RES**, tesseract::PointerVectortesseract::WERD_RES) () from /usr/lib/libtesseract.so.5
#19 0x00007ffff7cd1ac5 in tesseract::Tesseract::classify_word_and_language(int, tesseract::PAGE_RES_IT*, tesseract::WordData*) () from /usr/lib/libtesseract.so.5
#20 0x00007ffff7cd573d in tesseract::Tesseract::RecogAllWordsPassN(int, tesseract::ETEXT_DESC*, tesseract::PAGE_RES_IT*, std::vector<tesseract::WordData, std::allocatortesseract::WordData >) ()
from /usr/lib/libtesseract.so.5
#21 0x00007ffff7cd5ee5 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES
, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) () from /usr/lib/libtesseract.so.5
#22 0x00007ffff7c9a23d in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) () from /usr/lib/libtesseract.so.5
#23 0x00007ffff7c9d963 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#24 0x00007ffff7c9ef88 in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#25 0x00007ffff7c9f1b4 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) () from /usr/lib/libtesseract.so.5
#26 0x0000555555558797 in ?? ()
#27 0x00007ffff714ec88 in ?? () from /usr/lib/libc.so.6
#28 0x00007ffff714ed4c in __libc_start_main () from /usr/lib/libc.so.6
#29 0x00005555555598b5 in ?? ()

@stweil
Copy link
Contributor

stweil commented Jun 8, 2024

I get no crash with a debug build on Debian GNU Linux.

@stweil stweil changed the title SIGFPE Floating point exception since version 5.4.0 Floating point exception on Arch Linux since version 5.4.0 Jun 8, 2024
@stweil
Copy link
Contributor

stweil commented Jun 8, 2024

This patch fixes it:

diff --git a/src/classify/normmatch.cpp b/src/classify/normmatch.cpp
index 6ea75b99..f89897f5 100644
--- a/src/classify/normmatch.cpp
+++ b/src/classify/normmatch.cpp
@@ -145,7 +145,7 @@ float Classify::ComputeNormMatch(CLASS_ID ClassId, const FEATURE_STRUCT &feature
       BestMatch = Match;
     }
   }
-  return 1 - NormEvidenceOf(BestMatch);
+  return (Protos == nullptr) ? 1 : 1 - NormEvidenceOf(BestMatch);
 } /* ComputeNormMatch */
 
 void Classify::FreeNormProtos() {

I still have to examine why the exception does not occur on Linux.

@amitdo

This comment was marked as outdated.

@stweil
Copy link
Contributor

stweil commented Jun 9, 2024

Arch Linux installs model files from tessdata, Debian installs models files from tessdata_fast.

With a tessdata model I could reproduce the FP overflow in NormEvidenceOf which is called with FLT_MAX and tries to calculate the square of this value.

I wonder why none of our continuous integration tests detected this regression. Obviously the tests must be improved.

@stweil stweil changed the title Floating point exception on Arch Linux since version 5.4.0 Floating point exception with tessdata models since version 5.4.0 Jun 9, 2024
stweil added a commit to stweil/tesseract that referenced this issue Jun 9, 2024
Signed-off-by: Stefan Weil <sw@weilnetz.de>
stweil added a commit that referenced this issue Jun 9, 2024
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@stweil
Copy link
Contributor

stweil commented Jun 11, 2024

The new release 5.4.1 includes the fix, so this issue can be closed as soon as the fix was confirmed with an update package for Arch Linux.

@MarcRdC
Copy link

MarcRdC commented Jun 11, 2024

I compiled the package; it seems to be working fine: no more floating‐point exceptions.

@amitdo amitdo closed this as completed Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants