-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monster PR] Upgrade to PDFBox 2.0 #150
Conversation
org.apache.pdfbox.examples.util.RemoveAllText
- Temporally set height
… string Add a test writer two tables for CSV output
Add more csv tests
# Conflicts: # src/test/resources/technology/tabula/json/schools.json # src/test/resources/technology/tabula/json/spanning_cells.json # src/test/resources/technology/tabula/json/spanning_cells_basic.json # src/test/resources/technology/tabula/json/twotables.json
add more tests
Those of you that have expressed interest in this (@gudipatiharitha, @subhashbylaiah, @beng06, @kapil-mangtani, @chezou), we're close to merging this to |
Thanks for your great work @melisabok ! All tests in tabula-py have passed with this code. I also confirmed #114 is resolved :) |
Merged to |
Awesome, super exciting work. FYI, I made a couple of tiny tweaks to the tabula gui to support PDFBox 2.0 and the current tabula-java master branch. https://github.com/tabulapdf/tabula/tree/pdfbox2 A basic test of the GUI works fine with these tweaks (and I imagine the rest does, I just haven't checked). I don't want to merge them into Tabula's master until we're ready to do a release of tabula-java, but, just FYI, that's there in case you're experimenting with the GUI. It's worth considering adding a |
Thanks for that. One additional thing that we should do before merging your
branch to master is replacing the current JPedal renderer with one that
used pdfbox (it renders PDFs perfectly now). I implemented one in
http://GitHub.com/tabulapdf/tabula-web-java (we would just need to Port it
to ruby)
|
@jazzido Cool, I can handle htat |
* Starting with upgrade to PDFBox 2.0 (tabulapdf#52) * 2.0 * little progress in upgrading to pdfbox 2 * upgrade to pdfbox 2 starting to show signs of life * Fix TextElement creation * fix tabs * Use the code from LegacyPDFStreamEngine to create the TextElements * Fix removeText function using the example: org.apache.pdfbox.examples.util.RemoveAllText * close the document * close removed text document * fix array serialization * add spanning cells test with CSV format * - Remove capheight calculation - Temporally set height * Test writer two tables checking the json result object instead of the string Add a test writer two tables for CSV output * Fix pageTransform when there is a rotation Add more csv tests * fix path iterator * update json tests * update json outputs * upgrade pdfbox version * back to the old implementation and catch the IndexOutOfBoundsException * Remove hardcoded code * Remove more hardcoded code * test all the elements of the detected table * Change the expected table top value * Increase the threshold factor to support a greater headings * Fix rectangle comparator. * fix wrong expected column size, 5 instead of 6. add more tests * update expected table, more spaces are expected to respect the alingment. * when the text value has length > 1, clean the spaces. * clean code * remove stackstrace * add log error * upgrade all dependencies * code formatting * setting pom to snapshot version
Now that @melisabok 's fantastic work has been merged to a branch, it's time to review it and finally merge it to
master
.I plan to merge and cut a new
tabula-java
release within the next few days. In the mean time, feel free to comment, test and play with it.