python-pdf-extractor-library-benchmarking

A simple benchmarking of PDF extractor library in Python based on how accurate they extract text from the PDF.

About the project

There are many pdf extractor libraries for Python out there with all their pros and cons. Choosing one from many can feels numbing. We can go to google and search "the best python library for extracting pdf" and get various results. But then it is back to your need. What exactly you need? You need to extract the informations from the PDF. And it usually the text information. But what's the best extracting library to choose? Here I propose a way to benchmarking the libraries: by checking the accuracy of words extracted from the PDF source.

Libraries

Python libraries I used for this project are:

PyPDF
PDFMiner
PyMuPDF
Textblob
Pandas

Sources

All e-books I'm using for benchmarking these libraries collected from Project Gutenberg.

A History of Rome to 565 A. D. by Arthur E. R. Boak
Days of Heaven Upon Earth by A. B. Simpson
Hidden Symbolism of Alchemy and the Occult Arts by Herbert Silberer
Tempest and Sunshine by Mary Jane Holmes
The Samurai Strategy by Thomas Hoover

How I did the benchmarking

I extract ten pages from each sources using PyPDF, PDFMiner, and PyMuPDF. I made a class to feed in each libraries result to Textblob and check whether the words are correct of misspelled. I count the misspelled and using the number to calculate the accuracy of each libraries for each sources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

python-pdf-extractor-library-benchmarking

About the project

Libraries

Sources

How I did the benchmarking

Files

README.md

Latest commit

History

README.md

File metadata and controls

python-pdf-extractor-library-benchmarking

About the project

Libraries

Sources

How I did the benchmarking