Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: gif not scraping in 5.2 that was OK in 5.1 #3940

Closed
jacekkopecky opened this issue Oct 7, 2022 · 8 comments
Closed

Regression: gif not scraping in 5.2 that was OK in 5.1 #3940

jacekkopecky opened this issue Oct 7, 2022 · 8 comments

Comments

@jacekkopecky
Copy link

Environment

  • Tesseract Version: v5.2.0.20220712
  • Platform: Windows 10, x64

Current Behavior:

tesseract.exe v 5.2 only scrapes top line (window title) from this gif image:

image-screenshot-pdf.gif

Expected Behavior:

It should scrape all text.

Tesseract v5.1.0.20220510 worked as expected, and both 5.1 and 5.2 work as expected with this equivalent png image:

image-screenshot-pdf.png

Suggested Fix:

It might have something to do with a different DPI estimation: 5.2 estimates resolution 132, while 5.1 estimated 168. However, running 5.2 with --dpi 168 does not seem to fix anything.

@amitdo
Copy link
Collaborator

amitdo commented Oct 7, 2022

What's the output of tesseract -v for each version?

I want to see if the leptonica and gif libraries being used are exactly the same.

@jacekkopecky
Copy link
Author

tesseract v5.1.0.20220510
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0
tesseract v5.2.0.20220712
 leptonica-1.78.0
  libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
 Found AVX512BW
 Found AVX512F
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.5.0 zlib/1.2.11 liblzma/5.2.3 bz2lib/1.0.6 liblz4/1.7.5 libzstd/1.4.5
 Found libcurl/7.77.0-DEV Schannel zlib/1.2.11 zstd/1.4.5 libidn2/2.0.4 nghttp2/1.31.0

@zdenop
Copy link
Contributor

zdenop commented Oct 7, 2022

It seems like there is a problem with thresholding/binarization :

tesseract i3940.gif i3940.gif get.images produced:

i3940 gif processed

tesseract i3940.png i3940.png get.images produced:

i3940 png processed

By the way: gif has 256 colors (8 BitsPerPixel), and png version has 16,7 Million colors (32 BitsPerPixel)

@amitdo
Copy link
Collaborator

amitdo commented Oct 7, 2022

18fb5aa

@zdenop
Copy link
Contributor

zdenop commented Oct 8, 2022

@amitdo : that commit is not the problem: the input gif image has already an 8-bit depth, so it just makes a copy of it.

@amitdo
Copy link
Collaborator

amitdo commented Oct 8, 2022

5.1.0...5.2.0

@zdenop
Copy link
Contributor

zdenop commented Oct 8, 2022

ok. I got it: problem is causing Colormap.

@zdenop
Copy link
Contributor

zdenop commented Oct 14, 2022

Explanation of problem/dilemma:

  1. removing a colormap from a 8bit color image returns 32bit image => output pdf file size will increase - see issue PNG images being reencoded unnecessarily in PDFs #3092.
  2. not removing colormap cause problem on images like above (interesting is that also image from issue 3092 is the 8bit with colormap)

zdenop added a commit that referenced this issue Oct 19, 2022
fix issue #3940 - remove colormap before thresholding
stweil added a commit that referenced this issue Oct 24, 2022
Fixes: 95019a8 ("fix issue #3940 - remove colormap before thresholding")
Signed-off-by: Stefan Weil <sw@weilnetz.de>
@amitdo amitdo closed this as completed Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants