PyPDFLoader use url in metadata source if file is a web path #12092

123-fake-st · 2023-10-20T23:06:34Z

Description: Update langchain.document_loaders.pdf.PyPDFLoader to store url in metadata (instead of a temporary file path) if user provides a web path to a pdf

Issue: Related to Loading online PDFs gives temporary file path as source in metadata #7034; the reporter on that issue submitted a PR updating PyMuPDFParser for this behavior, but it has unresolved merge issues as of 20 Oct 2023 Changed the metadata source from temporary file location to original … #7077
- In addition to PyPDFLoader and PyMuPDFParser, these other classes in langchain.document_loaders.pdf exhibit similar behavior and could benefit from an update: PyPDFium2Loader, PDFMinerLoader, PDFMinerPDFasHTMLLoader, PDFPlumberLoader (I'm happy to contribute to some/all of that, including assisting with PyMuPDFParser, if my work is agreeable)
- The root cause is that the underlying pdf parser classes, e.g. langchain.document_loaders.parsers.pdf.PyPDFParser, never receive information about the url; the parsers receive a langchain.document_loaders.blob_loaders.blob, which contains the pdf contents and local file path, but not the url
- This update passes the web path directly to the parser since it's minimally invasive and doesn't require further changes to maintain existing behavior for local files... bigger picture, I'd consider extending blob so that extra information like this can be communicated, but that has much bigger implications on the codebase which I think warrants maintainer input
Dependencies: None

# old behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0}

# new behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0}

vercel · 2023-10-20T23:06:41Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 27, 2023 5:08pm

eyurtsev

hi @123-fake-st thanks for the contribution! I think this has to be fixed at the blob level. I would rather avoid adding complexity to the parsers while they're simple since it's hard to deprecate it afterwards as users start to depend on these parameters.

… web path

123-fake-st · 2023-10-27T16:56:50Z

hi @eyurtsev: thank you for the review and of course to you and the team in general for the work on this project. Your response makes sense and I had been stuck on the idea that blob and/or Blob.from_path() would need to be modified to allow web path information to be passed in. This update reverts changes to the pdf parser and limits code changes to PyPDFLoader only, while retaining the desired new behavior.

123-fake-st

Reverted all changes to langchain.document_loaders.parsers.pdf.PyPDFParser and updated langchain.document_loaders.pdf.PyPDFLoader to be able to create a langchain.document_loaders.blob_loaders.blob that has pdf data and uses a web url as the path, when appropriate.

eyurtsev

This works!

…in-ai#12092) **Description:** Update `langchain.document_loaders.pdf.PyPDFLoader` to store url in metadata (instead of a temporary file path) if user provides a web path to a pdf - **Issue:** Related to langchain-ai#7034; the reporter on that issue submitted a PR updating `PyMuPDFParser` for this behavior, but it has unresolved merge issues as of 20 Oct 2023 langchain-ai#7077 - In addition to `PyPDFLoader` and `PyMuPDFParser`, these other classes in `langchain.document_loaders.pdf` exhibit similar behavior and could benefit from an update: `PyPDFium2Loader`, `PDFMinerLoader`, `PDFMinerPDFasHTMLLoader`, `PDFPlumberLoader` (I'm happy to contribute to some/all of that, including assisting with `PyMuPDFParser`, if my work is agreeable) - The root cause is that the underlying pdf parser classes, e.g. `langchain.document_loaders.parsers.pdf.PyPDFParser`, never receive information about the url; the parsers receive a `langchain.document_loaders.blob_loaders.blob`, which contains the pdf contents and local file path, but not the url - This update passes the web path directly to the parser since it's minimally invasive and doesn't require further changes to maintain existing behavior for local files... bigger picture, I'd consider extending `blob` so that extra information like this can be communicated, but that has much bigger implications on the codebase which I think warrants maintainer input - **Dependencies:** None ```python # old behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0} # new behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0} ```

…13274) - **Description:** Update 5 pdf document loaders in `langchain.document_loaders.pdf`, to store a url in the metadata (instead of a temporary, local file path) if the user provides a web path to a pdf: `PyPDFium2Loader`, `PDFMinerLoader`, `PDFMinerPDFasHTMLLoader`, `PyMuPDFLoader`, and `PDFPlumberLoader` were updated. - The updates follow the approach used to update `PyPDFLoader` for the same behavior in #12092 - The `PyMuPDFLoader` changes required additional work in updating `langchain.document_loaders.parsers.pdf.PyMuPDFParser` to be able to process either an `io.BufferedReader` (from local pdf) or `io.BytesIO` (from online pdf) - The `PDFMinerPDFasHTMLLoader` change used a simpler approach since the metadata is assigned by the loader and not the parser - **Issue:** Fixes #7034 - **Dependencies:** None ```python # PyPDFium2Loader example: # old behavior >>> from langchain.document_loaders import PyPDFium2Loader >>> loader = PyPDFium2Loader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': '/var/folders/7z/d5dt407n673drh1f5cm8spj40000gn/T/tmpm5oqa92f/tmp.pdf', 'page': 0} # new behavior >>> from langchain.document_loaders import PyPDFium2Loader >>> loader = PyPDFium2Loader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0} ```

PyPDFLoader use url in metadata source if file is a web path

99da3d5

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 20, 2023

baskaryan requested a review from eyurtsev October 23, 2023 19:32

baskaryan assigned eyurtsev Oct 23, 2023

eyurtsev requested changes Oct 27, 2023

View reviewed changes

Merge branch 'langchain-ai:master' into feature/pypdfloader_web_path

5ce64ff

vercel bot deployed to Preview October 27, 2023 13:36 View deployment

123-fake-st added 2 commits October 27, 2023 11:34

Merge branch 'langchain-ai:master' into feature/pypdfloader_web_path

162de30

have PyPDFLoader store url in metadata source via blob when file is a…

eb95204

… web path

Merge branch 'master' into feature/pypdfloader_web_path

2c1176b

vercel bot deployed to Preview October 27, 2023 17:08 View deployment

123-fake-st commented Oct 27, 2023

View reviewed changes

123-fake-st requested a review from eyurtsev October 27, 2023 18:54

eyurtsev approved these changes Nov 1, 2023

View reviewed changes

eyurtsev added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Nov 1, 2023

eyurtsev approved these changes Nov 1, 2023

View reviewed changes

eyurtsev merged commit 8bd3ce5 into langchain-ai:master Nov 1, 2023
21 checks passed

123-fake-st mentioned this pull request Nov 13, 2023

update pdf document loaders' metadata source to url for online pdf #13274

Merged

123-fake-st deleted the feature/pypdfloader_web_path branch December 1, 2023 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPDFLoader use url in metadata source if file is a web path #12092

PyPDFLoader use url in metadata source if file is a web path #12092

123-fake-st commented Oct 20, 2023

vercel bot commented Oct 20, 2023 •

edited

Loading

eyurtsev left a comment

123-fake-st commented Oct 27, 2023

123-fake-st left a comment

eyurtsev left a comment

PyPDFLoader use url in metadata source if file is a web path #12092

PyPDFLoader use url in metadata source if file is a web path #12092

Conversation

123-fake-st commented Oct 20, 2023

vercel bot commented Oct 20, 2023 • edited Loading

eyurtsev left a comment

Choose a reason for hiding this comment

123-fake-st commented Oct 27, 2023

123-fake-st left a comment

Choose a reason for hiding this comment

eyurtsev left a comment

Choose a reason for hiding this comment

vercel bot commented Oct 20, 2023 •

edited

Loading