Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to extract and check tables in pdf files #6

Open
VitalyShalaev opened this issue Feb 2, 2021 · 4 comments
Open

Add the ability to extract and check tables in pdf files #6

VitalyShalaev opened this issue Feb 2, 2021 · 4 comments

Comments

@VitalyShalaev
Copy link
Contributor

Many pdf files contain tables, but the pdf-test only extracts text from pages. I would like to check the contents of the tables in a more convenient way.

@asolntsev
Copy link
Member

@VitalyShalaev Thank you for the suggestion. How exactly do you want to find the tables?

@VitalyShalaev
Copy link
Contributor Author

@asolntsev
So far there are 2 options:

  1. Automatic detection on the page and output of the result in the form of a list of lists.
  2. Definition of the table by the specified column names.

@VitalyShalaev
Copy link
Contributor Author

@asolntsev
For case 1:
Found the SpreadsheetExtractionAlgorithm at https://mvnrepository.com/artifact/technology.tabula/tabula/1.0.4.
But there are some problems with defining the text in the cells of some tables, trying to determine the reason, but not sure if I can do it.

@VitalyShalaev
Copy link
Contributor Author

@asolntsev
For case 1:
I found a library that works well https://mvnrepository.com/artifact/e-iceblue/spire.pdf/10.1.3
Example: https://medium.com/@alice.yang_10652/extract-table-data-from-pdf-in-java-8dc4fad73752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants