You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to parse a huge, but regular document. Tabula correctly parses the 1st page, but using --pages all causes it to consume a huge amount of memory as, I guess, it tries to do processing considering every page before starting to parse, which would be the right thing to do on arbitrary file.
It seems that running tabula once on every page and appending the output to a csv would run a lot faster and on much less resources.
Considering that it easily goes beyond 8GB of ram (~5500 pages), and causes the jvm's gc to go crazy, there should be an option to just parse the pages individually and incrementally (printing to stdout asap).
The text was updated successfully, but these errors were encountered:
That's something that we'd also like to implement. However, AFAICT, there's no way to avoid reading all the pages in a PDF with PDFBox 1.8. Until we finish the migration to PDFBox 2.0 (#52), I suggest that you split the PDF in smaller chunks with (with pdftk, for example) and process them separately.
I want to parse a huge, but regular document. Tabula correctly parses the 1st page, but using
--pages all
causes it to consume a huge amount of memory as, I guess, it tries to do processing considering every page before starting to parse, which would be the right thing to do on arbitrary file.It seems that running tabula once on every page and appending the output to a csv would run a lot faster and on much less resources.
Considering that it easily goes beyond 8GB of ram (~5500 pages), and causes the jvm's gc to go crazy, there should be an option to just parse the pages individually and incrementally (printing to stdout asap).
The text was updated successfully, but these errors were encountered: