DirectoryLoader(silent_errors=True) gives warnings about files which have some issues, Can we get those files in a list after loading a directory. #16863
ragvendra3898
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Checked
Feature request
Hi, I am using below code for indexing my documents to chromadb
from langchain_community.document_loaders import DirectoryLoader
vectordb = connect_vectordb()
loader = DirectoryLoader('/home/uploaded' , silent_errors=True, use_multithreading=True)
docs = loader.load()
documents = filter_complex_metadata(docs)
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(separators=["\n\n", "\n",
"\t"],chunk_size=delib.CHUNK_SIZE_TOKENS,chunk_overlap=200)
texts = text_splitter.split_documents(documents)
num_documents = len(texts)
logging.info(f"Split file into {num_documents} documents")
vectordb.add_documents(documents=texts, embedding=embeddings, persist_directory = db_directory)
and it gives warning about files which have some issues, sample is given below
2024-01-31T14:00:41.2197635Z 2024-01-31 14:00:41 WARNING: Error loading file /home/uploaded/roi.doc: soffice command was not found. Please install libreoffice
2024-01-31T14:04:15.2401154Z 2024-01-31 14:04:15 WARNING: The PDF <_io.BufferedReader name='/home/uploaded/3.0 upgrade considerations 01.13.21 secured.pdf'> contains a metadata field indicating that it should not allow text extraction. Ignoring this field and proceeding. Use the check_extractable if you want to raise an error in this case
2024-01-31T14:04:21.4557395Z 2024-01-31 14:04:21 WARNING: The MIME type of '/home/uploaded/302_vat_registration_and_more_enhancements (1080p) (1).mp4' is 'video/mp4'. This file type is not currently supported in unstructured.
2024-01-31T14:04:21.4579857Z 2024-01-31 14:04:21 WARNING: Error loading file /home/uploaded/302_vat_registration_and_more_enhancements (1080p) (1).mp4: Invalid file /home/uploaded/302_vat_registration_and_more_enhancements (1080p) (1).mp4. The FileType.UNK file type is not supported in partition.
but I wanted to all files in a list which have issues, is it possible or if not can we add this feature please.
Thanks
Motivation
I wanted to give the list of files which could not be index so that user could know that he should not query from those documents
Proposal (If applicable)
No response
Beta Was this translation helpful? Give feedback.
All reactions