-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: Building multiple indexes #16584
Comments
Khi tạo và lưu trữ nhiều loại indexes từ nodes trong LlamaIndex framework, có một số vấn đề và hạn chế tiềm ẩn mà bạn cần lưu ý:
Những vấn đề này nhấn mạnh tầm quan trọng của việc quản lý và cấu hình cẩn thận khi làm việc với nhiều indexes trong LlamaIndex framework. |
@dosu openai.api_key = st.secrets.openai.OPENAI_API_KEY def ingest_documents():
|
The For your use case with PDF documents of the ICD11 book, consider the following:
If the summaries generated are not meeting your expectations, you might need to refine your prompt template or consider additional preprocessing steps to better structure the input data for the extractor. |
@dosu |
Đối với dữ liệu ICD11, việc chọn giữa
Lựa chọn:
Với dữ liệu ICD11, thường có cấu trúc phức tạp và yêu cầu hiểu biết ngữ nghĩa, |
@dosu Đây là nội dung documents của tôi: |
The issue you're encountering, where a PDF page with only a title results in a document and subsequently a node containing just that title, is likely due to how the document loading and node creation processes handle minimal content. In the LlamaIndex framework, when a document is created from a PDF page, it includes whatever text is present, even if it's just a title. This behavior can lead to nodes that contain only minimal content, such as a title, which might not be useful for your application. To address this, you can implement a preprocessing step to filter out or handle documents with minimal content before they are converted into nodes. Here are a few strategies you might consider:
By incorporating these strategies, you can ensure that only documents with meaningful content are processed into nodes, improving the quality and relevance of your data. |
Vui lòng cung cấp code minh họa |
To address the issue of creating nodes from PDF pages that only contain titles, you can implement a preprocessing step to filter out documents with minimal content. Here's a code example illustrating how you might achieve this: from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline, IngestionCache
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.core.extractors import SummaryExtractor
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI
import openai
import streamlit as st
from src.global_settings import STORAGE_PATH, FILES_PATH, CACHE_FILE
from src.prompts import CUSTORM_SUMMARY_EXTRACT_TEMPLATE
openai.api_key = st.secrets.openai.OPENAI_API_KEY
Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)
def ingest_documents():
# Load documents
documents = SimpleDirectoryReader(
input_files=FILES_PATH,
filename_as_id=True
).load_data()
# Filter out documents with minimal content
filtered_documents = [doc for doc in documents if len(doc.text.strip()) > 10] # Adjust threshold as needed
for doc in filtered_documents:
print(doc.id_)
try:
cached_hashes = IngestionCache.from_persist_path(CACHE_FILE)
print("Cache file found. Running using cache...")
except:
cached_hashes = ""
print("No cache file found. Running without cache...")
pipeline = IngestionPipeline(
transformations=[
TokenTextSplitter(chunk_size=1024, chunk_overlap=20),
SummaryExtractor(summaries=['self'], prompt_template=CUSTORM_SUMMARY_EXTRACT_TEMPLATE),
OpenAIEmbedding()
],
cache=cached_hashes
)
nodes = pipeline.run(documents=filtered_documents)
pipeline.cache.persist(CACHE_FILE)
return nodes
# Call the function to ingest documents
nodes = ingest_documents() In this example, the |
Question Validation
Question
Tôi tạo và lưu trữ nhiều loại indexes từ nodes, điều này có vấn đề gì không?
def build_indexes(nodes, persist_dir):
try:
storage_context = StorageContext.from_defaults(
persist_dir=persist_dir
)
The text was updated successfully, but these errors were encountered: