Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyPDFLoader use url in metadata source if file is a web path #12092

Merged

Conversation

123-fake-st
Copy link
Contributor

Description: Update langchain.document_loaders.pdf.PyPDFLoader to store url in metadata (instead of a temporary file path) if user provides a web path to a pdf

  • Issue: Related to Loading online PDFs gives temporary file path as source in metadata #7034; the reporter on that issue submitted a PR updating PyMuPDFParser for this behavior, but it has unresolved merge issues as of 20 Oct 2023 Changed the metadata source from temporary file location to original … #7077

    • In addition to PyPDFLoader and PyMuPDFParser, these other classes in langchain.document_loaders.pdf exhibit similar behavior and could benefit from an update: PyPDFium2Loader, PDFMinerLoader, PDFMinerPDFasHTMLLoader, PDFPlumberLoader (I'm happy to contribute to some/all of that, including assisting with PyMuPDFParser, if my work is agreeable)
    • The root cause is that the underlying pdf parser classes, e.g. langchain.document_loaders.parsers.pdf.PyPDFParser, never receive information about the url; the parsers receive a langchain.document_loaders.blob_loaders.blob, which contains the pdf contents and local file path, but not the url
    • This update passes the web path directly to the parser since it's minimally invasive and doesn't require further changes to maintain existing behavior for local files... bigger picture, I'd consider extending blob so that extra information like this can be communicated, but that has much bigger implications on the codebase which I think warrants maintainer input
  • Dependencies: None

# old behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0}

# new behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0}

@vercel
Copy link

vercel bot commented Oct 20, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 27, 2023 5:08pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Oct 20, 2023
@baskaryan baskaryan requested a review from eyurtsev October 23, 2023 19:32
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @123-fake-st thanks for the contribution! I think this has to be fixed at the blob level. I would rather avoid adding complexity to the parsers while they're simple since it's hard to deprecate it afterwards as users start to depend on these parameters.

@123-fake-st
Copy link
Contributor Author

hi @eyurtsev: thank you for the review and of course to you and the team in general for the work on this project. Your response makes sense and I had been stuck on the idea that blob and/or Blob.from_path() would need to be modified to allow web path information to be passed in. This update reverts changes to the pdf parser and limits code changes to PyPDFLoader only, while retaining the desired new behavior.

Copy link
Contributor Author

@123-fake-st 123-fake-st left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted all changes to langchain.document_loaders.parsers.pdf.PyPDFParser and updated langchain.document_loaders.pdf.PyPDFLoader to be able to create a langchain.document_loaders.blob_loaders.blob that has pdf data and uses a web url as the path, when appropriate.

@123-fake-st 123-fake-st requested a review from eyurtsev October 27, 2023 18:54
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works!

@eyurtsev eyurtsev added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Nov 1, 2023
@eyurtsev eyurtsev merged commit 8bd3ce5 into langchain-ai:master Nov 1, 2023
21 checks passed
xieqihui pushed a commit to xieqihui/langchain that referenced this pull request Nov 21, 2023
…in-ai#12092)

**Description:** Update `langchain.document_loaders.pdf.PyPDFLoader` to
store url in metadata (instead of a temporary file path) if user
provides a web path to a pdf

- **Issue:** Related to langchain-ai#7034; the reporter on that issue submitted a PR
updating `PyMuPDFParser` for this behavior, but it has unresolved merge
issues as of 20 Oct 2023 langchain-ai#7077
- In addition to `PyPDFLoader` and `PyMuPDFParser`, these other classes
in `langchain.document_loaders.pdf` exhibit similar behavior and could
benefit from an update: `PyPDFium2Loader`, `PDFMinerLoader`,
`PDFMinerPDFasHTMLLoader`, `PDFPlumberLoader` (I'm happy to contribute
to some/all of that, including assisting with `PyMuPDFParser`, if my
work is agreeable)
- The root cause is that the underlying pdf parser classes, e.g.
`langchain.document_loaders.parsers.pdf.PyPDFParser`, never receive
information about the url; the parsers receive a
`langchain.document_loaders.blob_loaders.blob`, which contains the pdf
contents and local file path, but not the url
- This update passes the web path directly to the parser since it's
minimally invasive and doesn't require further changes to maintain
existing behavior for local files... bigger picture, I'd consider
extending `blob` so that extra information like this can be
communicated, but that has much bigger implications on the codebase
which I think warrants maintainer input

  - **Dependencies:** None

```python
# old behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0}

# new behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0}
```
hwchase17 pushed a commit that referenced this pull request Nov 29, 2023
…13274)

- **Description:** Update 5 pdf document loaders in
`langchain.document_loaders.pdf`, to store a url in the metadata
(instead of a temporary, local file path) if the user provides a web
path to a pdf: `PyPDFium2Loader`, `PDFMinerLoader`,
`PDFMinerPDFasHTMLLoader`, `PyMuPDFLoader`, and `PDFPlumberLoader` were
updated.
- The updates follow the approach used to update `PyPDFLoader` for the
same behavior in #12092
- The `PyMuPDFLoader` changes required additional work in updating
`langchain.document_loaders.parsers.pdf.PyMuPDFParser` to be able to
process either an `io.BufferedReader` (from local pdf) or `io.BytesIO`
(from online pdf)
- The `PDFMinerPDFasHTMLLoader` change used a simpler approach since the
metadata is assigned by the loader and not the parser
  - **Issue:** Fixes #7034
  - **Dependencies:** None


```python
# PyPDFium2Loader example:
# old behavior
>>> from langchain.document_loaders import PyPDFium2Loader
>>> loader = PyPDFium2Loader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': '/var/folders/7z/d5dt407n673drh1f5cm8spj40000gn/T/tmpm5oqa92f/tmp.pdf', 'page': 0}

# new behavior
>>> from langchain.document_loaders import PyPDFium2Loader
>>> loader = PyPDFium2Loader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0}
```
@123-fake-st 123-fake-st deleted the feature/pypdfloader_web_path branch December 1, 2023 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants