-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
OCRmyPDF FAQ and Usage
OCRmyPDF is excellent at converting PDFs to searchable OCR PDFs in an unsupervised fashion. It uses all available CPU cores and has a pipelined architecture that helps to schedule CPUs efficiently.
It works well as a batch job, whether a for-loop or managed by a directory watching program. It has been used for large scale batch jobs.
"First do no harm" is a principle it tries to follow. Generally, it tries to change only things in a PDF that must change to complete OCR. If you accidentally run it on a regular "born digital" PDF (for example, a Microsoft Word document converted to PDF), or a PDF that contains a mix of "born digital" and scanned content, it can add OCR without destroying the scanned content (if the --skip-text
option is provided). It can also force rasterizing of all this content with --force-ocr
.
Abbyy FineReader 12 is the author's recommendation for anyone who needs to create a PDF whose annotated text needs to be perfect. This program automatically recognizes text and then gives the user the opportunity to correct it. However, this is boring (and some might even say, soul destroying). It is very sophisticated at detecting document elements such as tables and converting them to spreadsheets. For example, if one needs to extract scientific data from a scanned image, turn to FineReader. Its OCR is much slower than OCRmyPDF, but probably more accurate.
Adobe Acrobat XI can perform OCR, but it is slower than OCRmyPDF and similar in accuracy.
OCRmyPDF uses Tesseract-OCR as its OCR engine, so it depends entirely on Tesseract for OCR quality. Tesseract gives good results for clear black and white scans with common fonts, normal font sizes, and when the correct language is specified, the dictionary contains all the document's words, the document is oriented in one direction, deskewed and contains no multi-orientation elements, and basically the stars are aligned. If you can easily read a document yourself with no squinting or special effort, that is a good sign. OCRmyPDF and tesseract do an good job on files like tests/resources/LinnSequencer.jpg.
OCRmyPDF is good at making documents searchable by identifying keywords within it. In a huge collection, even its ability to only occasionally find useful keywords can be helpful for search.
Unfortunately, in many files certain patterns will confuse Tesseract, and it will find gibberish and not filter this out. Maps are one example – legend markers and geographic features will be reinterpreted as letters. The --debug-output
option reveals its findings for the curious.
If possible, OCRmyPDF will insert a text layer into your PDF and convert the result to PDF/A. PDF/A conversion may (probably does) transcode images to a standardized colorspace, which is what you want for long term archiving. Auto-rotation correction can be done without changing the quality of the image layer.
For some options and some PDF files, it will instead rasterize your PDF at the resolution of the highest quality image on a given page, perform OCR on then image, and then construct a new PDF based on the image and text. In this case, the resulting PDF could be larger than the input if multiple images are present, and vector content will be lost.
The UNIX principle of orthogonality: OCRmyPDF is really good at converting PDFs to OCR PDFs. It has one job, and aims to do that well. There are other tools we recommend that are really good at watching folders for changes and reacting to those changes (such as by launching an OCRmyPDF job): Gulp.js
, watchmedo
and watchman
are some examples.