-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyPDFLoader use url in metadata source if file is a web path #12092
PyPDFLoader use url in metadata source if file is a web path #12092
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @123-fake-st thanks for the contribution! I think this has to be fixed at the blob level. I would rather avoid adding complexity to the parsers while they're simple since it's hard to deprecate it afterwards as users start to depend on these parameters.
hi @eyurtsev: thank you for the review and of course to you and the team in general for the work on this project. Your response makes sense and I had been stuck on the idea that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted all changes to langchain.document_loaders.parsers.pdf.PyPDFParser
and updated langchain.document_loaders.pdf.PyPDFLoader
to be able to create a langchain.document_loaders.blob_loaders.blob
that has pdf data and uses a web url as the path, when appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works!
…in-ai#12092) **Description:** Update `langchain.document_loaders.pdf.PyPDFLoader` to store url in metadata (instead of a temporary file path) if user provides a web path to a pdf - **Issue:** Related to langchain-ai#7034; the reporter on that issue submitted a PR updating `PyMuPDFParser` for this behavior, but it has unresolved merge issues as of 20 Oct 2023 langchain-ai#7077 - In addition to `PyPDFLoader` and `PyMuPDFParser`, these other classes in `langchain.document_loaders.pdf` exhibit similar behavior and could benefit from an update: `PyPDFium2Loader`, `PDFMinerLoader`, `PDFMinerPDFasHTMLLoader`, `PDFPlumberLoader` (I'm happy to contribute to some/all of that, including assisting with `PyMuPDFParser`, if my work is agreeable) - The root cause is that the underlying pdf parser classes, e.g. `langchain.document_loaders.parsers.pdf.PyPDFParser`, never receive information about the url; the parsers receive a `langchain.document_loaders.blob_loaders.blob`, which contains the pdf contents and local file path, but not the url - This update passes the web path directly to the parser since it's minimally invasive and doesn't require further changes to maintain existing behavior for local files... bigger picture, I'd consider extending `blob` so that extra information like this can be communicated, but that has much bigger implications on the codebase which I think warrants maintainer input - **Dependencies:** None ```python # old behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0} # new behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0} ```
…13274) - **Description:** Update 5 pdf document loaders in `langchain.document_loaders.pdf`, to store a url in the metadata (instead of a temporary, local file path) if the user provides a web path to a pdf: `PyPDFium2Loader`, `PDFMinerLoader`, `PDFMinerPDFasHTMLLoader`, `PyMuPDFLoader`, and `PDFPlumberLoader` were updated. - The updates follow the approach used to update `PyPDFLoader` for the same behavior in #12092 - The `PyMuPDFLoader` changes required additional work in updating `langchain.document_loaders.parsers.pdf.PyMuPDFParser` to be able to process either an `io.BufferedReader` (from local pdf) or `io.BytesIO` (from online pdf) - The `PDFMinerPDFasHTMLLoader` change used a simpler approach since the metadata is assigned by the loader and not the parser - **Issue:** Fixes #7034 - **Dependencies:** None ```python # PyPDFium2Loader example: # old behavior >>> from langchain.document_loaders import PyPDFium2Loader >>> loader = PyPDFium2Loader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': '/var/folders/7z/d5dt407n673drh1f5cm8spj40000gn/T/tmpm5oqa92f/tmp.pdf', 'page': 0} # new behavior >>> from langchain.document_loaders import PyPDFium2Loader >>> loader = PyPDFium2Loader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0} ```
Description: Update
langchain.document_loaders.pdf.PyPDFLoader
to store url in metadata (instead of a temporary file path) if user provides a web path to a pdfIssue: Related to Loading online PDFs gives temporary file path as source in metadata #7034; the reporter on that issue submitted a PR updating
PyMuPDFParser
for this behavior, but it has unresolved merge issues as of 20 Oct 2023 Changed the metadata source from temporary file location to original … #7077PyPDFLoader
andPyMuPDFParser
, these other classes inlangchain.document_loaders.pdf
exhibit similar behavior and could benefit from an update:PyPDFium2Loader
,PDFMinerLoader
,PDFMinerPDFasHTMLLoader
,PDFPlumberLoader
(I'm happy to contribute to some/all of that, including assisting withPyMuPDFParser
, if my work is agreeable)langchain.document_loaders.parsers.pdf.PyPDFParser
, never receive information about the url; the parsers receive alangchain.document_loaders.blob_loaders.blob
, which contains the pdf contents and local file path, but not the urlblob
so that extra information like this can be communicated, but that has much bigger implications on the codebase which I think warrants maintainer inputDependencies: None