PDF output without image #83

olcc · 2015-08-15T15:12:18Z

Hello,

I noticed the new "pdf" option in Tesseract, which creates a PDF file with the image and the background text. That's great !

But usually, the image given to Tesseract is not as nice as the starting image (because it is optimized for OCR, not for human visualization). Maybe it would be useful to provide the step before, i.e. the PDF of the generated text without the image, so that the user can paste it as a background text with pdftk for example.

jbarlow83 · 2015-08-16T08:03:14Z

@olcc Tesseract is a raw OCR engine. Have a look at my project, OCRmyPDF, which provides a nice wrapper around Tesseract and takes care of many details to improve visualization.

Wikinaut · 2015-08-16T08:15:45Z

@olcc link is OCRmyPDF

ws233 · 2015-08-16T10:04:26Z

@olcc, the way to produce PDF has significantly changed in Tesseract 8.04. So I have a plan to change this in future commits. I'll take your idea into consideration. But as I remeber the new implementation does not produce the text anymore. It outputs directly to the file. But even with such effort you are able read the file manually and modify as you wish.

zdenop · 2015-08-16T10:30:34Z

@olcc: tesseract puts to pdf image that you provided as input (e.g. file you see in pdf is not optimized for OCR as you claims). If you have another experience - please provide example. Otherwise close the "issue".

olcc · 2015-08-16T19:56:01Z

@jbarlow83: Thanks for pointing to the "OCRmyPDF" wrapper.
@ws233: Tesseract 8.04? I'm quite late, I only have 3.04! ;-) (from Debian)
@zdenop: Sorry, I didn't understand your message. Maybe my English is not good enough. My process is the following:

ORIGINAL.jpg -> OCR.tif (remove colors, apply threshold, etc.)
tesseract OCR.tif result -l eng pdf
If you say that showing OCR.tif in the PDF is the right thing to do, I disagree in general. I agree this is a very nice feature. However, most people want to have ORIGINAL.jpg with the ocr text.

zdenop · 2015-08-16T20:02:49Z

What I want to say is that if you run:
tesseract OCR.tif ORIGINAL pdf
than ORIGINAL.tif is included in ORIGINAL.pdf WITHOUT any modification. If you want to include ORIGINAL.jpg instead of OCR.tif than it is not tesseract issue ;-)

Wikinaut · 2015-08-16T20:24:29Z

@olcc we here fully rely on these "mixed-mode" PDFs as generated by

tesseract OCR.tif ORIGINAL pdf

which works with very high quality, depending on the quality what you input to tesseract. I hope, that the present "pdf" option ( -c tessedit_create_pdf=1 ) will really never be dropped from the code.

amitdo · 2015-08-16T23:35:13Z

@zdenop, is this functionality documented anywhere?

Could you point me to the exact place in the code where it's implemented?

zdenop · 2015-08-17T07:04:49Z

@amitdo: it is implemented in pdfrenderer

This is not real issue (no bug in tessseract), so I close this issue. Please use tesseract user forum for asking question/support.

jbreiden · 2016-02-05T18:57:43Z

ORIGINAL.tif is included in ORIGINAL.pdf WITHOUT any modification

Whenever possible. The design intent is to copy the image bytes without using a
decompress/compress whenever we can. Sometimes that is impossible (TIFF
is an enormously flexible graphics format) and sometimes we haven't quite
gotten there. For example, TIFF CCITT Group 4 still goes through a lossless
decompress/compress. Simply because we haven't done the work to optimize
this code path in Tesseract / Leptonica. All relevant Tesseract code is in
ai/pdfrenderer.cc but we try to push the image heavy lifting into Leptonica.

https://en.wikipedia.org/wiki/Tagged_Image_File_Format#TIFF_Compression_Tag

sergiocallegari · 2016-05-13T13:56:12Z

I'd like to support the original wish. Having something like

tesseract OCR.tif ORIGINAL pdf-overlay

to produce only the text overlay in a pdf file would provide a lot of flexibility. With this, you could write frontends to tesseract capable of overlaying the invisible text overlay on something different from OCR.tiff (e.g. a full color version of OCR.tif, etc.)

zdenop closed this as completed Aug 17, 2015

amitdo mentioned this issue Feb 5, 2016

Feature request: Add optional input for alternate image to use when sandwiching OCR data #210

Closed

amitdo added the PDF label May 30, 2016

kolomiyets mentioned this issue Feb 23, 2017

Misaligned (wrong?) debug log #735

Closed

ashinpan mentioned this issue Jul 12, 2023

build failure: The source directory "tiff_test.cpp" does not exist. #4101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF output without image #83

PDF output without image #83

olcc commented Aug 15, 2015

jbarlow83 commented Aug 16, 2015

Wikinaut commented Aug 16, 2015

ws233 commented Aug 16, 2015

zdenop commented Aug 16, 2015

olcc commented Aug 16, 2015

zdenop commented Aug 16, 2015

Wikinaut commented Aug 16, 2015

amitdo commented Aug 16, 2015

zdenop commented Aug 17, 2015

jbreiden commented Feb 5, 2016

sergiocallegari commented May 13, 2016

PDF output without image #83

PDF output without image #83

Comments

olcc commented Aug 15, 2015

jbarlow83 commented Aug 16, 2015

Wikinaut commented Aug 16, 2015

ws233 commented Aug 16, 2015

zdenop commented Aug 16, 2015

olcc commented Aug 16, 2015

zdenop commented Aug 16, 2015

Wikinaut commented Aug 16, 2015

amitdo commented Aug 16, 2015

zdenop commented Aug 17, 2015

jbreiden commented Feb 5, 2016

sergiocallegari commented May 13, 2016