Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monster PR] Upgrade to PDFBox 2.0 #150

Merged
merged 50 commits into from
Mar 27, 2017
Merged

[Monster PR] Upgrade to PDFBox 2.0 #150

merged 50 commits into from
Mar 27, 2017

Conversation

jazzido
Copy link
Contributor

@jazzido jazzido commented Mar 24, 2017

Now that @melisabok 's fantastic work has been merged to a branch, it's time to review it and finally merge it to master.

I plan to merge and cut a new tabula-java release within the next few days. In the mean time, feel free to comment, test and play with it.

jazzido and others added 30 commits December 3, 2015 17:01
org.apache.pdfbox.examples.util.RemoveAllText
- Temporally set height
… string

Add a test writer two tables for CSV output
# Conflicts:
#	src/test/resources/technology/tabula/json/schools.json
#	src/test/resources/technology/tabula/json/spanning_cells.json
#	src/test/resources/technology/tabula/json/spanning_cells_basic.json
#	src/test/resources/technology/tabula/json/twotables.json
@jazzido
Copy link
Contributor Author

jazzido commented Mar 24, 2017

Those of you that have expressed interest in this (@gudipatiharitha, @subhashbylaiah, @beng06, @kapil-mangtani, @chezou), we're close to merging this to master. We'd love some feedback before we do so.

@chezou
Copy link
Contributor

chezou commented Mar 26, 2017

Thanks for your great work @melisabok ! All tests in tabula-py have passed with this code. I also confirmed #114 is resolved :)

@jazzido jazzido merged commit f4c094e into master Mar 27, 2017
@jazzido
Copy link
Contributor Author

jazzido commented Mar 27, 2017

Merged to master and updated version number to 1.0.0-SNAPSHOT.

@jazzido jazzido deleted the pdfbox2.0 branch April 6, 2017 01:28
@jeremybmerrill
Copy link
Member

Awesome, super exciting work.

FYI, I made a couple of tiny tweaks to the tabula gui to support PDFBox 2.0 and the current tabula-java master branch. https://github.com/tabulapdf/tabula/tree/pdfbox2 A basic test of the GUI works fine with these tweaks (and I imagine the rest does, I just haven't checked).

I don't want to merge them into Tabula's master until we're ready to do a release of tabula-java, but, just FYI, that's there in case you're experimenting with the GUI.

It's worth considering adding a pageCount method to ObjectExtractor in Java. It would just be cleaner than the way I did it in Ruby and users of the Java APIs and other language bindings will have a good reason to use it (without having to figure out what's going on with PDDocument.

@jazzido
Copy link
Contributor Author

jazzido commented Apr 14, 2017 via email

@jeremybmerrill
Copy link
Member

@jazzido Cool, I can handle htat

EmpowerZ pushed a commit to EmpowerZ/tabula-java that referenced this pull request Oct 23, 2020
* Starting with upgrade to PDFBox 2.0 (tabulapdf#52)

* 2.0

* little progress in upgrading to pdfbox 2

* upgrade to pdfbox 2 starting to show signs of life

* Fix TextElement creation

* fix tabs

* Use the code from LegacyPDFStreamEngine to create the TextElements

* Fix removeText function using the example:

org.apache.pdfbox.examples.util.RemoveAllText

* close the document

* close removed text document

* fix array serialization

* add spanning cells test with CSV format

* - Remove capheight calculation
- Temporally set height

* Test writer two tables checking the json result object instead of the string

Add a test writer two tables for CSV output

* Fix pageTransform when there is a rotation
Add more csv tests

* fix path iterator

* update json tests

* update json outputs

* upgrade pdfbox version

* back to the old implementation and catch the IndexOutOfBoundsException

* Remove hardcoded code

* Remove more hardcoded code

* test all the elements of the detected table

* Change the expected table top value

* Increase the threshold factor to support a greater headings

* Fix rectangle comparator.

* fix wrong expected column size, 5 instead of 6.

add more tests

* update expected table, more spaces are expected to respect the alingment.

* when the text value has length > 1, clean the spaces.

* clean code

* remove stackstrace

* add log error

* upgrade all dependencies

* code formatting

* setting pom to snapshot version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants