Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for incremental output #113

Open
Dietr1ch opened this issue Oct 23, 2016 · 3 comments
Open

Support for incremental output #113

Dietr1ch opened this issue Oct 23, 2016 · 3 comments

Comments

@Dietr1ch
Copy link

I want to parse a huge, but regular document. Tabula correctly parses the 1st page, but using --pages all causes it to consume a huge amount of memory as, I guess, it tries to do processing considering every page before starting to parse, which would be the right thing to do on arbitrary file.

It seems that running tabula once on every page and appending the output to a csv would run a lot faster and on much less resources.
Considering that it easily goes beyond 8GB of ram (~5500 pages), and causes the jvm's gc to go crazy, there should be an option to just parse the pages individually and incrementally (printing to stdout asap).

@jazzido
Copy link
Contributor

jazzido commented Oct 23, 2016

Hi @Dietr1ch,

That's something that we'd also like to implement. However, AFAICT, there's no way to avoid reading all the pages in a PDF with PDFBox 1.8. Until we finish the migration to PDFBox 2.0 (#52), I suggest that you split the PDF in smaller chunks with (with pdftk, for example) and process them separately.

@Dietr1ch
Copy link
Author

Thanks for the quick response.
I was trying this, but it also seems much slower than it should.

for p in (seq (pdf-pageCount doc.pdf))
    java -jar target/tabula-0.9.1-jar-with-dependencies.jar  --pages=$p doc.pdf >> doc.csv
end

(pdf-pageCount just filters the output of pdfinfo)

It seems that #52 blocks this.

@jazzido
Copy link
Contributor

jazzido commented May 2, 2017

Leaving a comment to bump this issue. Now that #52 is closed, we should be able to implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants