Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed the metadata source from temporary file location to original … #7077

Closed
wants to merge 4 commits into from

Conversation

pranavpandey2511
Copy link

Changed the metadata source from temporary file location to original url/path in PyMuPDf
Issue: #7034
@rlancemartin , @eyurtsev

@vercel
Copy link

vercel bot commented Jul 3, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 12, 2023 2:24am

@dosubot dosubot bot added the 🤖:nit Small modifications/deletions, fixes, deps or improvements to existing code or docs label Jul 3, 2023
@vercel vercel bot temporarily deployed to Preview July 3, 2023 11:34 Inactive
Copy link
Contributor

@hwchase17 hwchase17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these are the same:

    @property
    def source(self) -> Optional[str]:
        """The source location of the blob as string if known otherwise none."""
        return str(self.path) if self.path else None

@pranavpandey2511
Copy link
Author

It was indeed same, original web_path is present is set as the source value in BasePDFLoader, set the blob source to that.

@vercel vercel bot temporarily deployed to Preview July 6, 2023 17:02 Inactive
@vercel vercel bot temporarily deployed to Preview July 12, 2023 02:24 Inactive
@leo-gan
Copy link
Collaborator

leo-gan commented Sep 18, 2023

@pranavpandey2511 Hi , could you, please, resolve the merging issues? After that ping me and I push this PR for the review. Thanks!

eyurtsev pushed a commit that referenced this pull request Nov 1, 2023
**Description:** Update `langchain.document_loaders.pdf.PyPDFLoader` to
store url in metadata (instead of a temporary file path) if user
provides a web path to a pdf

- **Issue:** Related to #7034; the reporter on that issue submitted a PR
updating `PyMuPDFParser` for this behavior, but it has unresolved merge
issues as of 20 Oct 2023 #7077
- In addition to `PyPDFLoader` and `PyMuPDFParser`, these other classes
in `langchain.document_loaders.pdf` exhibit similar behavior and could
benefit from an update: `PyPDFium2Loader`, `PDFMinerLoader`,
`PDFMinerPDFasHTMLLoader`, `PDFPlumberLoader` (I'm happy to contribute
to some/all of that, including assisting with `PyMuPDFParser`, if my
work is agreeable)
- The root cause is that the underlying pdf parser classes, e.g.
`langchain.document_loaders.parsers.pdf.PyPDFParser`, never receive
information about the url; the parsers receive a
`langchain.document_loaders.blob_loaders.blob`, which contains the pdf
contents and local file path, but not the url
- This update passes the web path directly to the parser since it's
minimally invasive and doesn't require further changes to maintain
existing behavior for local files... bigger picture, I'd consider
extending `blob` so that extra information like this can be
communicated, but that has much bigger implications on the codebase
which I think warrants maintainer input

  - **Dependencies:** None

```python
# old behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0}

# new behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0}
```
@efriis
Copy link
Member

efriis commented Nov 7, 2023

Closing because the PR wouldn't line up with the current directory structure of the library (would need to be in /libs/langchain/langchain instead of /langchain). Feel free to reopen against the current head if it's still relevant!

@efriis efriis closed this Nov 7, 2023
xieqihui pushed a commit to xieqihui/langchain that referenced this pull request Nov 21, 2023
…in-ai#12092)

**Description:** Update `langchain.document_loaders.pdf.PyPDFLoader` to
store url in metadata (instead of a temporary file path) if user
provides a web path to a pdf

- **Issue:** Related to langchain-ai#7034; the reporter on that issue submitted a PR
updating `PyMuPDFParser` for this behavior, but it has unresolved merge
issues as of 20 Oct 2023 langchain-ai#7077
- In addition to `PyPDFLoader` and `PyMuPDFParser`, these other classes
in `langchain.document_loaders.pdf` exhibit similar behavior and could
benefit from an update: `PyPDFium2Loader`, `PDFMinerLoader`,
`PDFMinerPDFasHTMLLoader`, `PDFPlumberLoader` (I'm happy to contribute
to some/all of that, including assisting with `PyMuPDFParser`, if my
work is agreeable)
- The root cause is that the underlying pdf parser classes, e.g.
`langchain.document_loaders.parsers.pdf.PyPDFParser`, never receive
information about the url; the parsers receive a
`langchain.document_loaders.blob_loaders.blob`, which contains the pdf
contents and local file path, but not the url
- This update passes the web path directly to the parser since it's
minimally invasive and doesn't require further changes to maintain
existing behavior for local files... bigger picture, I'd consider
extending `blob` so that extra information like this can be
communicated, but that has much bigger implications on the codebase
which I think warrants maintainer input

  - **Dependencies:** None

```python
# old behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0}

# new behavior
>>> from langchain.document_loaders import PyPDFLoader
>>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf')
>>> docs = loader.load()
>>> docs[0].metadata
{'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:nit Small modifications/deletions, fixes, deps or improvements to existing code or docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants