Support for incremental output #113

Dietr1ch · 2016-10-23T23:41:33Z

I want to parse a huge, but regular document. Tabula correctly parses the 1st page, but using --pages all causes it to consume a huge amount of memory as, I guess, it tries to do processing considering every page before starting to parse, which would be the right thing to do on arbitrary file.

It seems that running tabula once on every page and appending the output to a csv would run a lot faster and on much less resources.
Considering that it easily goes beyond 8GB of ram (~5500 pages), and causes the jvm's gc to go crazy, there should be an option to just parse the pages individually and incrementally (printing to stdout asap).

The text was updated successfully, but these errors were encountered:

jazzido · 2016-10-23T23:51:51Z

Hi @Dietr1ch,

That's something that we'd also like to implement. However, AFAICT, there's no way to avoid reading all the pages in a PDF with PDFBox 1.8. Until we finish the migration to PDFBox 2.0 (#52), I suggest that you split the PDF in smaller chunks with (with pdftk, for example) and process them separately.

Dietr1ch · 2016-10-24T00:11:27Z

Thanks for the quick response.
I was trying this, but it also seems much slower than it should.

for p in (seq (pdf-pageCount doc.pdf))
    java -jar target/tabula-0.9.1-jar-with-dependencies.jar  --pages=$p doc.pdf >> doc.csv
end

(pdf-pageCount just filters the output of pdfinfo)

It seems that #52 blocks this.

jazzido · 2017-05-02T23:37:28Z

Leaving a comment to bump this issue. Now that #52 is closed, we should be able to implement this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for incremental output #113

Support for incremental output #113

Dietr1ch commented Oct 23, 2016

jazzido commented Oct 23, 2016

Dietr1ch commented Oct 24, 2016

jazzido commented May 2, 2017

Support for incremental output #113

Support for incremental output #113

Comments

Dietr1ch commented Oct 23, 2016

jazzido commented Oct 23, 2016

Dietr1ch commented Oct 24, 2016

jazzido commented May 2, 2017