-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF output without image #83
Comments
@olcc Tesseract is a raw OCR engine. Have a look at my project, OCRmyPDF, which provides a nice wrapper around Tesseract and takes care of many details to improve visualization. |
@olcc, the way to produce PDF has significantly changed in Tesseract 8.04. So I have a plan to change this in future commits. I'll take your idea into consideration. But as I remeber the new implementation does not produce the text anymore. It outputs directly to the file. But even with such effort you are able read the file manually and modify as you wish. |
@olcc: tesseract puts to pdf image that you provided as input (e.g. file you see in pdf is not optimized for OCR as you claims). If you have another experience - please provide example. Otherwise close the "issue". |
@jbarlow83: Thanks for pointing to the "OCRmyPDF" wrapper.
|
What I want to say is that if you run: |
@olcc we here fully rely on these "mixed-mode" PDFs as generated by
which works with very high quality, depending on the quality what you input to tesseract. I hope, that the present "pdf" option ( -c tessedit_create_pdf=1 ) will really never be dropped from the code. |
@zdenop, is this functionality documented anywhere? Could you point me to the exact place in the code where it's implemented? |
@amitdo: it is implemented in pdfrenderer This is not real issue (no bug in tessseract), so I close this issue. Please use tesseract user forum for asking question/support. |
Whenever possible. The design intent is to copy the image bytes without using a https://en.wikipedia.org/wiki/Tagged_Image_File_Format#TIFF_Compression_Tag |
I'd like to support the original wish. Having something like tesseract OCR.tif ORIGINAL pdf-overlay to produce only the text overlay in a pdf file would provide a lot of flexibility. With this, you could write frontends to tesseract capable of overlaying the invisible text overlay on something different from OCR.tiff (e.g. a full color version of OCR.tif, etc.) |
Hello,
I noticed the new "pdf" option in Tesseract, which creates a PDF file with the image and the background text. That's great !
But usually, the image given to Tesseract is not as nice as the starting image (because it is optimized for OCR, not for human visualization). Maybe it would be useful to provide the step before, i.e. the PDF of the generated text without the image, so that the user can paste it as a background text with pdftk for example.
The text was updated successfully, but these errors were encountered: