Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnstructuredHTMLLoader fail when given Path type document #29090

Closed
5 tasks done
Marsman1996 opened this issue Jan 8, 2025 · 0 comments · Fixed by #29091
Closed
5 tasks done

UnstructuredHTMLLoader fail when given Path type document #29090

Marsman1996 opened this issue Jan 8, 2025 · 0 comments · Fixed by #29091
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@Marsman1996
Copy link
Contributor

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

from pathlib import Path
from langchain_community.document_loaders import UnstructuredHTMLLoader

document = Path("./test.html")
loader = UnstructuredHTMLLoader(document, mode="elements", strategy="fast")
loader.load()

Error Message and Stack Trace (if applicable)

Traceback (most recent call last):
  File "/home/Marsman1996/afgen/test/./hello.py", line 6, in <module>
    loader.load()
    ~~~~~~~~~~~^^
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/langchain_community/document_loaders/unstructured.py", line 107, in lazy_load
    elements = self._get_elements()
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/langchain_community/document_loaders/html.py", line 33, in _get_elements
    return partition_html(filename=self.file_path, **self.unstructured_kwargs)  # type: ignore[arg-type]
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/partition/common/metadata.py", line 162, in wrapper
    elements = func(*args, **kwargs)
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/partition/html/partition.py", line 91, in partition_html
    return list(_HtmlPartitioner.iter_elements(opts))
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/partition/html/partition.py", line 189, in iter_elements
    yield from cls(opts)._iter_elements()
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/partition/html/partition.py", line 203, in _iter_elements
    e.metadata.last_modified = self._opts.last_modified
                               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/utils.py", line 154, in __get__
    value = self._fget(obj)
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/partition/html/partition.py", line 160, in last_modified
    if not self._file_path or is_temp_file_path(self._file_path)
                              ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/home/Marsman1996/afgen/test/.venv/lib/python3.13/site-packages/unstructured/utils.py", line 68, in is_temp_file_path
    return file_path.startswith(tempfile.gettempdir())
           ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PosixPath' object has no attribute 'startswith'

Description

I'm trying to use langchain to parse HTML file and give a Path type variable according to the doc.
The doc says UnstructuredHTMLLoader could receive file_path: str | List[str] | Path | List[Path].
However, actually it could only deal with the str type input...

System Info

$ python -m langchain_core.sys_info                      

System Information
------------------
> OS:  Linux
> OS Version:  #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec  5 13:09:44 UTC 2024
> Python Version:  3.13.0 (main, Oct 16 2024, 03:23:02) [Clang 18.1.8 ]

Package Information
-------------------
> langchain_core: 0.3.29
> langchain: 0.3.14
> langchain_community: 0.3.14
> langsmith: 0.2.10
> langchain_text_splitters: 0.3.5

Optional packages not installed
-------------------------------
> langserve

Other Dependencies
------------------
> aiohttp: 3.11.11
> async-timeout: Installed. No version info available.
> dataclasses-json: 0.6.7
> httpx: 0.28.1
> httpx-sse: 0.4.0
> jsonpatch: 1.33
> langsmith-pyo3: Installed. No version info available.
> numpy: 1.26.4
> orjson: 3.10.13
> packaging: 24.2
> pydantic: 2.9.2
> pydantic-settings: 2.7.1
> PyYAML: 6.0.2
> requests: 2.32.3
> requests-toolbelt: 1.0.0
> SQLAlchemy: 2.0.36
> tenacity: 9.0.0
> typing-extensions: 4.12.2
> zstandard: Installed. No version info available.
@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Jan 8, 2025
@ccurme ccurme closed this as completed in 2b09f79 Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant