complete strategy for long documents #264

mophilly · 2025-02-19T20:34:29Z

mophilly
Feb 19, 2025
Collaborator

I have several sample PDF documents that a dense financial statements. Each is over 60 pages long. Each page has a header that repeats document id's and column headings.

I have a script that can process correctly smaller documents with the same layout and two or three pages. Well, still wrangling with retrieving some values in correct position.

For the longer document the LLM hit limits. So I tried completion strategy, paginate and concatenate. For these the pydantic validation raises a great many errors and aborts the process.

I am looking into increasing limits at the ai companies, to see if that helps.

For now I am stuck. Any advice?
P.S. I am happy to buy a coffee or some $$ for one-on-one help.

enoch3712 · 2025-02-19T21:09:24Z

enoch3712
Feb 19, 2025
Maintainer

Hello @mophilly!

Sorry for the delay, was a contractor for this and got too busy. Gonna take care of all the rest!

For the longer document the LLM hit limits. So I tried completion strategy, paginate and concatenate. For these the pydantic validation raises a great many errors and aborts the process.

Hmm. Should not happen and should go transparent if the content is not too complex.\

For now I am stuck. Any advice?
Do the following: Can you use Gemini 2.0? The cheapest and basically infinite context lenght

Also, use Docling if possible, with vision, everything will be added into the response and the results will be pretty much ok.

If stratregies dont work, like paginate, its odd, means the model is too complex.

Don't worry, no money needed :)

0 replies

mophilly · 2025-02-20T00:19:41Z

mophilly
Feb 20, 2025
Collaborator Author

I will try Docling.

The model is rather complex. I used classes with sub-classes to follow the structure of the source documents. That may need to be refactored.

The document structure:
statement header (number, date, total value, payor, payee)
->> property (ID, location)
       ->> product (name)
               ->> transaction (can be one or more rows, each with 15 columns)

So the pydantic model follows that using references.

Class Statement
    number
    date (etc.)
    properties: List(Property)

Class Property
    ID
    name
    Products: List(Product)

And so on. The transaction detail table is a sparse matrix, creating a spreadsheet-like presentation of sale amounts, deductions and adjustments.

1 reply

enoch3712 Feb 20, 2025
Maintainer

@mophilly Seems to be just fine.I dont see anything strange.

If you want, send me an email, im doing this all day, so feel free to ping, i can just jump on a call and help you

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

complete strategy for long documents #264

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

complete strategy for long documents #264

mophilly Feb 19, 2025 Collaborator

Replies: 2 comments · 1 reply

enoch3712 Feb 19, 2025 Maintainer

mophilly Feb 20, 2025 Collaborator Author

enoch3712 Feb 20, 2025 Maintainer

mophilly
Feb 19, 2025
Collaborator

Replies: 2 comments 1 reply

enoch3712
Feb 19, 2025
Maintainer

mophilly
Feb 20, 2025
Collaborator Author

enoch3712 Feb 20, 2025
Maintainer