Changed the metadata source from temporary file location to original … #7077

pranavpandey2511 · 2023-07-03T11:20:07Z

Changed the metadata source from temporary file location to original url/path in PyMuPDf
Issue: #7034
@rlancemartin , @eyurtsev

…url/path in PyMuPDf

vercel · 2023-07-03T11:20:11Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jul 12, 2023 2:24am

hwchase17

i think these are the same:

    @property
    def source(self) -> Optional[str]:
        """The source location of the blob as string if known otherwise none."""
        return str(self.path) if self.path else None

pranavpandey2511 · 2023-07-06T16:56:58Z

It was indeed same, original web_path is present is set as the source value in BasePDFLoader, set the blob source to that.

leo-gan · 2023-09-18T23:09:18Z

@pranavpandey2511 Hi , could you, please, resolve the merging issues? After that ping me and I push this PR for the review. Thanks!

**Description:** Update `langchain.document_loaders.pdf.PyPDFLoader` to store url in metadata (instead of a temporary file path) if user provides a web path to a pdf - **Issue:** Related to #7034; the reporter on that issue submitted a PR updating `PyMuPDFParser` for this behavior, but it has unresolved merge issues as of 20 Oct 2023 #7077 - In addition to `PyPDFLoader` and `PyMuPDFParser`, these other classes in `langchain.document_loaders.pdf` exhibit similar behavior and could benefit from an update: `PyPDFium2Loader`, `PDFMinerLoader`, `PDFMinerPDFasHTMLLoader`, `PDFPlumberLoader` (I'm happy to contribute to some/all of that, including assisting with `PyMuPDFParser`, if my work is agreeable) - The root cause is that the underlying pdf parser classes, e.g. `langchain.document_loaders.parsers.pdf.PyPDFParser`, never receive information about the url; the parsers receive a `langchain.document_loaders.blob_loaders.blob`, which contains the pdf contents and local file path, but not the url - This update passes the web path directly to the parser since it's minimally invasive and doesn't require further changes to maintain existing behavior for local files... bigger picture, I'd consider extending `blob` so that extra information like this can be communicated, but that has much bigger implications on the codebase which I think warrants maintainer input - **Dependencies:** None ```python # old behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0} # new behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0} ```

efriis · 2023-11-07T03:47:38Z

Closing because the PR wouldn't line up with the current directory structure of the library (would need to be in /libs/langchain/langchain instead of /langchain). Feel free to reopen against the current head if it's still relevant!

…in-ai#12092) **Description:** Update `langchain.document_loaders.pdf.PyPDFLoader` to store url in metadata (instead of a temporary file path) if user provides a web path to a pdf - **Issue:** Related to langchain-ai#7034; the reporter on that issue submitted a PR updating `PyMuPDFParser` for this behavior, but it has unresolved merge issues as of 20 Oct 2023 langchain-ai#7077 - In addition to `PyPDFLoader` and `PyMuPDFParser`, these other classes in `langchain.document_loaders.pdf` exhibit similar behavior and could benefit from an update: `PyPDFium2Loader`, `PDFMinerLoader`, `PDFMinerPDFasHTMLLoader`, `PDFPlumberLoader` (I'm happy to contribute to some/all of that, including assisting with `PyMuPDFParser`, if my work is agreeable) - The root cause is that the underlying pdf parser classes, e.g. `langchain.document_loaders.parsers.pdf.PyPDFParser`, never receive information about the url; the parsers receive a `langchain.document_loaders.blob_loaders.blob`, which contains the pdf contents and local file path, but not the url - This update passes the web path directly to the parser since it's minimally invasive and doesn't require further changes to maintain existing behavior for local files... bigger picture, I'd consider extending `blob` so that extra information like this can be communicated, but that has much bigger implications on the codebase which I think warrants maintainer input - **Dependencies:** None ```python # old behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': '/var/folders/w2/zx77z1cs01s1thx5dhshkd58h3jtrv/T/tmpfgrorsi5/tmp.pdf', 'page': 0} # new behavior >>> from langchain.document_loaders import PyPDFLoader >>> loader = PyPDFLoader('https://arxiv.org/pdf/1706.03762.pdf') >>> docs = loader.load() >>> docs[0].metadata {'source': 'https://arxiv.org/pdf/1706.03762.pdf', 'page': 0} ```

Changed the metadata source from temporary file location to original …

396e62d

…url/path in PyMuPDf

dosubot bot added the 🤖:nit Small modifications/deletions, fixes, deps or improvements to existing code or docs label Jul 3, 2023

vercel bot temporarily deployed to Preview July 3, 2023 11:34 Inactive

hwchase17 reviewed Jul 5, 2023

View reviewed changes

Added original file source in metadata

da68599

vercel bot temporarily deployed to Preview July 6, 2023 17:02 Inactive

baskaryan added 2 commits July 11, 2023 22:14

Merge branch 'master' into pranavpandey2511/master

ed602c7

fmt

aff899f

vercel bot temporarily deployed to Preview July 12, 2023 02:24 Inactive

123-fake-st mentioned this pull request Oct 20, 2023

PyPDFLoader use url in metadata source if file is a web path #12092

Merged

efriis closed this Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed the metadata source from temporary file location to original … #7077

Changed the metadata source from temporary file location to original … #7077

pranavpandey2511 commented Jul 3, 2023

vercel bot commented Jul 3, 2023 •

edited

Loading

hwchase17 left a comment

pranavpandey2511 commented Jul 6, 2023

leo-gan commented Sep 18, 2023

efriis commented Nov 7, 2023

Changed the metadata source from temporary file location to original … #7077

Changed the metadata source from temporary file location to original … #7077

Conversation

pranavpandey2511 commented Jul 3, 2023

vercel bot commented Jul 3, 2023 • edited Loading

hwchase17 left a comment

Choose a reason for hiding this comment

pranavpandey2511 commented Jul 6, 2023

leo-gan commented Sep 18, 2023

efriis commented Nov 7, 2023

vercel bot commented Jul 3, 2023 •

edited

Loading