Comparison with other PDF Table Extraction libraries and tools

This page of the wiki aims to compare Camelot's output (qualitatively) with other open-source libraries and tools. Chances are that you've already used one of the libraries/tools mentioned below, have had problems with getting the desired output and are here to see if Camelot can extract tables from your PDFs better.

We believe that Camelot works better than other open-source alternatives out there, we try to avoid bias though, and be fair and accurate here, by listing down advantages other tools might have over Camelot. (While also listing down steps with which Camelot makes up for them using one or more of the configuration parameters.)

We would like your help to keep this document up-to-date. If notice any inconsistency, please let us know by opening an issue.

Table of contents

Tabula
pdfplumber
pdftables
pdf-table-extract

Tabula

The naming for parsing methods inside Camelot (i.e. Lattice and Stream) was inspired from Tabula. Lattice is used to parse tables that have demarcated lines between cells, while Stream is used to parse tables that have whitespaces between cells to simulate a table structure.

We took 10 PDFs of each type (lines, for Lattice and whitespaces between tables cells, for Stream) and passed them through Tabula's web interface and Camelot's command-line interface. The CSV outputs were pushed to this repo as is. We found that Camelot works better than Tabula in all Lattice cases. Tabula does better table detection for Stream cases, but it still fails to give good parsing output, which Camelot solves for with its configuration parameters.

Note: We have better table detection for Stream cases in the works. #102

We put a ✔️ in the "Table detected correctly?" column if the table was detected accurately and ❌ if it was not (providing an image of the detected table in both cases). The reasoning behind which output is better is provided in the "Comments" column.

Lattice

n	PDF	Notes	Table detected correctly?		Extra configuration used?		Result		Which has better output?	Comments
			Tabula	Camelot	Tabula	Camelot	Tabula	Camelot
1.	agstat.pdf	Header text is vertical, columns span multiple cells.	❌ image	✔️ image	NA	No	csv	csv	Camelot	Tabula doesn't output all the header text. Camelot gets all the headers in the correct cells, albeit in reverse order in some cases.
2.	background_lines_1.pdf	The lines are in background.	❌ image	✔️ image	NA	-back	csv	csv	Both
3.	background_lines_2.pdf	The lines are in background.	✔️ image	✔️ image	NA	-scale 40 -back	csv	csv	Camelot	Tabula shifts some of the data points towards the left. Camelot gets the table as is.
4.	column_span_1.pdf	Columns spans multiple cells.	✔️ image	✔️ image	NA	No	csv	csv	Camelot	Tabula moves some headers on the top-right to the left. Camelot gets them in the correct cells.
5.	column_span_2.pdf	Columns spans multiple cells.	✔️ image	✔️ image	NA	-scale 40	csv	csv	Camelot	Tabula shifts some of the data points towards the left. Camelot gets the table as is. (For ex: The number 1728)
6.	electoral_roll.pdf	Very unusual table.	✔️ (almost) image	✔️ image	NA	-scale 40 -I 1	csv	csv	Camelot	Tabula doesn't give an output. Camelot is able to get all text out while preserving the table structure, which is usable by cleaning after some patter matching.
7.	rotated.pdf	The table is rotated counter-clockwise.	❌ image	✔️ image	NA	No	csv	csv	Camelot	Tabula output is unusable, Camelot gets the table out as is.
8.	row_span_1.pdf	Rows span multiple cells.	✔️ image	✔️ image	NA	-scale 40 -block 99 -const -20	csv	csv	Camelot	Tabula shifts some of the data points towards the left. Camelot gets the table as is. Check out the totals near the bottom-right.
9.	twotables_1.pdf	There are two tables on a single page.	✔️ (almost) image	✔️ image	NA	No	csv	csv1 csv2	Camelot	Tabula output is unusable, Camelot gets the tables out as they are.
10.	twotables_2.pdf	There are two tables on a single page.	✔️ image	✔️ image		No	csv1 csv2	csv1 csv2	Both

Stream

n	PDF	Notes	Table detected correctly?		Extra configuration used?		Result		Which has better output?	Comments
			Tabula	Camelot	Tabula	Camelot	Tabula	Camelot
1.	12s0324.pdf	There are two tables on a single page.	✔️	NA	NA		csv1 csv2	csv1 csv2	Both
2.	birdisland.pdf	PDF is encrypted.	✔️	NA	NA		csv	csv	Tabula	Camelot detects two tables, and even though the structure is correct, duplicate strings are found in the same cells. Bug filed. #103.
3.	budget.pdf		✔️	NA	NA	No	csv	csv	Camelot	Tabula merges the last two columns into one, Camelot gets them correctly.
4.	district_health.pdf		✔️	NA	NA	No	csv	csv	Camelot	Tabula merges all the columns. Camelot assigns the data points to the correct cells.
5.	health.pdf		✔️	NA	NA	No	csv	csv	Camelot	Same as above.
6.	m27.pdf	The text is very close. (difficult to differentiate between columns)	✔️	NA	NA	-C 72,95,209,327,442,529,566,606,683 -split	csv	csv	Camelot	Tabula merges some columns. Camelot uses its "-split" feature along with column separators to cut the text strings at those coordinates and put them in the correct cells.
7.	mexican_towns.pdf		✔️	NA	NA	No	csv	csv	Both
8.	missing_values.pdf	Two columns don't have any values.	✔️	NA	NA	No	csv	csv	Camelot	Tabula merges some columns, Camelot gets them correctly.
9.	population_growth.pdf		✔️	NA	NA	No	csv	csv	Both
10.	superscript.pdf	A number has another number in superscript. (Refer the 2nd column for row starting with Kerala)	✔️	NA	NA	-flag	csv	csv	Camelot	Tabula merges the superscript with the number, which doesn't matter in this case due to the decimal point but can change the number by 10x without the point. Camelot uses a configuration parameter to delimit the superscripts with <s></s> tags, so that they can be handled during cleaning.

pdfplumber

5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdfplumber) and Camelot's command-line-interface.

The reasoning behind which output is better is provided in the "Comments" column.

n	PDF	Notes	Result		Which has better output?	Comments
			pdfplumber	Camelot
1.	agstat.pdf	Header text is vertical, columns span multiple cells.	csv	csv	Camelot	pdfplumber messes up header text.
2.	column_span_1.pdf	Columns spans multiple cells.	csv	csv	Both
3.	rotated.pdf	The table is rotated counter-clockwise.	csv	csv	Camelot	pdfplumber output unusable.
4.	twotables_1.pdf	There are two tables on a single page.	csv	csv1 csv2	Camelot	pdfplumber doesn't identify two tables and output is unusable.
5.	twotables_2.pdf	There are two tables on a single page.	csv	csv1 csv2	Camelot	pdfplumber doesn't identify two tables and output is unusable.
6.	budget.pdf		errored	csv	Camelot
7.	district_health.pdf		csv	csv	Camelot	pdfplumber output unusable, merged columns.
8.	health.pdf		csv	csv	Camelot	pdfplumber output unusable, merged columns.
9.	mexican_towns.pdf		errored	csv	Camelot
10.	missing_values.pdf	Two columns don't have any values.	csv	csv	Camelot	pdfplumber output unusable, merged columns.

pdftables

The open-source development for pdftables was stopped in September 2013, when it became a closed-source paid tool.

Again, 5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdftables) and Camelot's command-line-interface.

Again, the reasoning behind which output is better is provided in the "Comments" column.

n	PDF	Notes	Result		Which has better output?	Comments
			pdftables	Camelot
1.	agstat.pdf	Header text is vertical, columns span multiple cells.	csv	csv	Camelot	pdftables output unusable, merged columns.
2.	column_span_1.pdf	Columns spans multiple cells.	csv	csv	Camelot	pdftables output unusable, merged columns.
3.	rotated.pdf	The table is rotated counter-clockwise.	csv	csv	Camelot	pdftables output unusable.
4.	twotables_1.pdf	There are two tables on a single page.	csv	csv1 csv2	Camelot	pdftables doesn't combine multi-line rows.
5.	twotables_2.pdf	There are two tables on a single page.	csv	csv1 csv2	Camelot	pdftables output unusable, merged columns.
6.	budget.pdf		csv	csv	Camelot	pdftables output unusable, merged columns.
7.	district_health.pdf		csv	csv	Camelot	pdftables output unusable, merged columns.
8.	health.pdf		csv	csv	Camelot	pdftables output unusable, merged columns.
9.	mexican_towns.pdf		csv	csv	Both
10.	missing_values.pdf	Two columns don't have any values.	csv	csv	Camelot	pdftables output unusable, merged columns.

pdf-table-extract

5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdf-table-extract) and Camelot's command-line-interface.

The reasoning behind which output is better is provided in the "Comments" column.

n	PDF	Notes	Result		Which has better output?	Comments
			pdf-table-extract (pte)	Camelot
1.	agstat.pdf	Header text is vertical, columns span multiple cells.	csv	csv	Both	Camelot puts vertical headers in reverse order. Bug filed. [#105]
2.	column_span_1.pdf	Columns spans multiple cells.	csv	csv	Camelot	pte gives extra columns.
3.	rotated.pdf	The table is rotated counter-clockwise.	csv	csv	Camelot	pte doesn't account for table rotation.
4.	twotables_1.pdf	There are two tables on a single page.	csv	csv1 csv2	Camelot	pte output unusable.
5.	twotables_2.pdf	There are two tables on a single page.	csv	csv1 csv2	Camelot	pte detects one table and merges first row with header.
6.	budget.pdf		csv	csv	Camelot	pte output unusable.
7.	district_health.pdf		csv	csv	Camelot	pte output unusable.
8.	health.pdf		csv	csv	Camelot	pte output unusable.
9.	mexican_towns.pdf		csv	csv	Camelot	pte output unusable.
10.	missing_values.pdf	Two columns don't have any values.	csv	csv	Camelot	pte output unusable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly