Add the ability to extract and check tables in pdf files #6

VitalyShalaev · 2021-02-02T05:53:23Z

Many pdf files contain tables, but the pdf-test only extracts text from pages. I would like to check the contents of the tables in a more convenient way.

asolntsev · 2021-02-04T22:03:11Z

@VitalyShalaev Thank you for the suggestion. How exactly do you want to find the tables?

VitalyShalaev · 2021-02-08T10:39:01Z

@asolntsev
So far there are 2 options:

Automatic detection on the page and output of the result in the form of a list of lists.
Definition of the table by the specified column names.

VitalyShalaev · 2021-02-09T06:27:31Z

@asolntsev
For case 1:
Found the SpreadsheetExtractionAlgorithm at https://mvnrepository.com/artifact/technology.tabula/tabula/1.0.4.
But there are some problems with defining the text in the cells of some tables, trying to determine the reason, but not sure if I can do it.

VitalyShalaev · 2024-02-09T13:13:55Z

@asolntsev
For case 1:
I found a library that works well https://mvnrepository.com/artifact/e-iceblue/spire.pdf/10.1.3
Example: https://medium.com/@alice.yang_10652/extract-table-data-from-pdf-in-java-8dc4fad73752

asolntsev added enhancement help wanted labels Dec 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to extract and check tables in pdf files #6

Add the ability to extract and check tables in pdf files #6

VitalyShalaev commented Feb 2, 2021

asolntsev commented Feb 4, 2021

VitalyShalaev commented Feb 8, 2021

VitalyShalaev commented Feb 9, 2021

VitalyShalaev commented Feb 9, 2024

Add the ability to extract and check tables in pdf files #6

Add the ability to extract and check tables in pdf files #6

Comments

VitalyShalaev commented Feb 2, 2021

asolntsev commented Feb 4, 2021

VitalyShalaev commented Feb 8, 2021

VitalyShalaev commented Feb 9, 2021

VitalyShalaev commented Feb 9, 2024