A simple benchmarking of PDF extractor library in Python based on how accurate they extract text from the PDF.
There are many pdf extractor libraries for Python out there with all their pros and cons. Choosing one from many can feels numbing. We can go to google and search "the best python library for extracting pdf" and get various results. But then it is back to your need. What exactly you need? You need to extract the informations from the PDF. And it usually the text information. But what's the best extracting library to choose? Here I propose a way to benchmarking the libraries: by checking the accuracy of words extracted from the PDF source.
Python libraries I used for this project are:
All e-books I'm using for benchmarking these libraries collected from Project Gutenberg.
- A History of Rome to 565 A. D. by Arthur E. R. Boak
- Days of Heaven Upon Earth by A. B. Simpson
- Hidden Symbolism of Alchemy and the Occult Arts by Herbert Silberer
- Tempest and Sunshine by Mary Jane Holmes
- The Samurai Strategy by Thomas Hoover
I extract ten pages from each sources using PyPDF, PDFMiner, and PyMuPDF. I made a class to feed in each libraries result to Textblob and check whether the words are correct of misspelled. I count the misspelled and using the number to calculate the accuracy of each libraries for each sources.