We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows
enviroment: kag 0.6 python3.9 Windows10
question: 使用outline_splitter对10页pdf文本进行按标题截取内容时,发现会起10个线程去http异步请求大模型API,但是在返回10个结果中却是有不少重复的标题,具体如下: 这个是开启pycharm debug模式下出现比较严重的情况:
不开启debug模式,正常跑也会至少出现3个以上的重复结果数据
chat_llm: &chat_llm api_key: token-abc123 base_url: http://xxxx:8889/v1/chat/completions model: qwen2.5-32b-instruct-awq type: vllm
vectorize_model: &vectorize_model api_key: EMPTY base_url: http://xxxx:6010/v1 model: lier007xiaobu_embedding_v2 type: openai vector_dimensions: 1536 vectorizer: *vectorize_model
log: level: DEBUG
project: biz_scene: default host_addr: http://xxxx:8887 id: '44' language: zh namespace: Law #------------project configuration end----------------#
#------------kag-builder configuration start----------------# kag_builder_pipeline: chain: type: unstructured_builder_chain # kag.builder.default_chain.DefaultUnstructuredBuilderChain extractor: type: schema_constraint_extractor # kag.builder.component.extractor.schema_constraint_extractor.SchemaConstraintExtractor llm: *openie_llm ner_prompt: type: law_spg_entity # kag.builder.prompt.spg_prompt.SPGEntityPrompt event_prompt: type: law_spg_event # kag.builder.prompt.spg_prompt.SPGEventPrompt std_prompt: type: default_std # kag.builder.prompt.default.std.OpenIEEntitystandardizationdPrompt relation_prompt: type: law_spg_relation # kag.builder.prompt.spg_prompt.SPGRelationPrompt reader: type: pdf_reader # kag.builder.component.reader.pdf_reader.PDFReader post_processor: type: kag_post_processor # kag.builder.component.postprocessor.kag_postprocessor.KAGPostProcessor similarity_threshold: 0.9 splitter: type: outline_splitter # kag.builder.component.splitter.outline_splitter.OutlineSplitter llm: *openie_llm vectorizer: type: batch_vectorizer # kag.builder.component.vectorizer.batch_vectorizer.BatchVectorizer vectorize_model: *vectorize_model writer: type: kg_writer # kag.builder.component.writer.kg_writer.KGWriter num_threads_per_chain: 1 num_chains: 16 scanner: type: file_scanner # kag.builder.component.scanner.file_scanner.FileScanner #------------kag-builder configuration end----------------#
#------------kag-solver configuration start----------------# search_api: &search_api type: openspg_search_api #kag.solver.tools.search_api.impl.openspg_search_api.OpenSPGSearchAPI
graph_api: &graph_api type: openspg_graph_api #kag.solver.tools.graph_api.impl.openspg_graph_api.OpenSPGGraphApi
exact_kg_retriever: &exact_kg_retriever type: default_exact_kg_retriever # kag.solver.retriever.impl.default_exact_kg_retriever.DefaultExactKgRetriever el_num: 5 llm_client: *chat_llm search_api: *search_api graph_api: *graph_api
fuzzy_kg_retriever: &fuzzy_kg_retriever type: default_fuzzy_kg_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever el_num: 5 vectorize_model: *vectorize_model llm_client: *chat_llm search_api: *search_api graph_api: *graph_api
chunk_retriever: &chunk_retriever type: default_chunk_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever llm_client: *chat_llm recall_num: 10 rerank_topk: 10
kag_solver_pipeline: memory: type: default_memory # kag.solver.implementation.default_memory.DefaultMemory llm_client: *chat_llm max_iterations: 3 reasoner: type: default_reasoner # kag.solver.implementation.default_reasoner.DefaultReasoner llm_client: *chat_llm lf_planner: type: default_lf_planner # kag.solver.plan.default_lf_planner.DefaultLFPlanner llm_client: *chat_llm vectorize_model: *vectorize_model lf_executor: type: default_lf_executor # kag.solver.execute.default_lf_executor.DefaultLFExecutor llm_client: *chat_llm force_chunk_retriever: true exact_kg_retriever: *exact_kg_retriever fuzzy_kg_retriever: *fuzzy_kg_retriever chunk_retriever: *chunk_retriever merger: type: default_lf_sub_query_res_merger # kag.solver.execute.default_sub_query_merger.DefaultLFSubQueryResMerger vectorize_model: *vectorize_model chunk_retriever: *chunk_retriever generator: type: default_generator # kag.solver.implementation.default_generator.DefaultGenerator llm_client: *chat_llm generate_prompt: type: resp_simple # kag/examples/2wiki/solver/prompt/resp_generator.py reflector: type: default_reflector # kag.solver.implementation.default_reflector.DefaultReflector llm_client: *chat_llm
#------------kag-solver configuration end----------------# ` 2. 在outline_splitter.py main函数中添加pdf路径
` if name == "main": from kag.builder.component.splitter.length_splitter import LengthSplitter from kag.builder.component.reader.docx_reader import DocxReader from kag.builder.component.reader.txt_reader import TXTReader from kag.builder.component.reader.pdf_reader import PDFReader pdf_reader = PDFReader() docx_reader = DocxReader() txt_reader = TXTReader() length_splitter = LengthSplitter(split_length=5000)
llm = LLMClient.from_config(KAG_CONFIG.all_config["openie_llm"]) outline_splitter = OutlineSplitter(llm=llm) txt_path = os.path.join( os.path.dirname(__file__), "../../../../tests/builder/data/儿科学_short.txt" ) docx_path = "/Users/zhangxinhong.zxh/Downloads/waikexue_short.docx" # test_dir = "/Users/zhangxinhong.zxh/Downloads/1127_medkag_book" # pdf_path = "/Users/zhangxinhong.zxh/Downloads/toaz.info-5dsm-5-pr_56e68a629dc4fe62699960dd5afbe362.pdf" pdf_path = os.path.join( os.path.dirname(__file__), "../../../examples/law/builder/data/xxxx.pdf" ) # files = [ # os.path.join(test_dir, file) # for file in os.listdir(test_dir) # if file.endswith(".docx") # ] # files = [ # files[0], # ] # with ThreadPoolExecutor(max_workers=10) as executor: # futures = [executor.submit(process_file, file) for file in files] # for future in as_completed(futures): # print(future.result()) # process_file_without_chain(docx_path) a = 1 # chunk = docx_reader.invoke(docx_path) # chunk = txt_reader.invoke(txt_path) chunk = pdf_reader.invoke(pdf_path) # chunks = length_splitter.invoke(chunk) chunks = outline_splitter.invoke(chunk) print(chunks)
`
The text was updated successfully, but these errors were encountered:
pdf_path
Could you provide your testset file to help us to reproduce your work?
Sorry, something went wrong.
Of course.
法律测试文件.pdf
northmachine
No branches or pull requests
Search before asking
Operating system information
Windows
What happened
enviroment:
kag 0.6 python3.9 Windows10
question:
![Image](https://private-user-images.githubusercontent.com/15890940/406631828-beca4856-4ccd-4fa9-a7ad-b3da1fad1dc1.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2OTE3MzgsIm5iZiI6MTczOTY5MTQzOCwicGF0aCI6Ii8xNTg5MDk0MC80MDY2MzE4MjgtYmVjYTQ4NTYtNGNjZC00ZmE5LWE3YWQtYjNkYTFmYWQxZGMxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE2VDA3MzcxOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPThiMzMyZjg4OGY1MDk1ZDFiZTNmYjI4YzE1N2VjMjMwNjQ4NTFlYmFjYjhjZTVmNmJjOWIwNDhkNmFhMjhlN2UmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.JsJ23gzgIPB0bWkFAgJw0K-Knv0422QKXZU0Jl91zB8)
使用outline_splitter对10页pdf文本进行按标题截取内容时,发现会起10个线程去http异步请求大模型API,但是在返回10个结果中却是有不少重复的标题,具体如下:
这个是开启pycharm debug模式下出现比较严重的情况:
不开启debug模式,正常跑也会至少出现3个以上的重复结果数据
![Image](https://private-user-images.githubusercontent.com/15890940/406632167-e7335527-74de-4da9-b707-8cde70dd029d.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk2OTE3MzgsIm5iZiI6MTczOTY5MTQzOCwicGF0aCI6Ii8xNTg5MDk0MC80MDY2MzIxNjctZTczMzU1MjctNzRkZS00ZGE5LWI3MDctOGNkZTcwZGQwMjlkLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE2VDA3MzcxOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTk4MzNmYzc1ZjExZDBlODJkOTczNDhjY2UxM2IzZjBhZThkYjFjOWE5NjQxZWE1NDA4N2Q2ZDQ0NGU5ZTczMWEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.bT8Ub0kIrH7xtXvKzerHQkugoxP6fx09ic1BFByl9iM)
How to reproduce
`
#------------project configuration start----------------#
openie_llm: &openie_llm
api_key: token-abc123
base_url: http://xxxx:8889/v1/chat/completions
model: qwen2.5-32b-instruct-awq
type: vllm
chat_llm: &chat_llm
api_key: token-abc123
base_url: http://xxxx:8889/v1/chat/completions
model: qwen2.5-32b-instruct-awq
type: vllm
vectorize_model: &vectorize_model
api_key: EMPTY
base_url: http://xxxx:6010/v1
model: lier007xiaobu_embedding_v2
type: openai
vector_dimensions: 1536
vectorizer: *vectorize_model
log:
level: DEBUG
project:
biz_scene: default
host_addr: http://xxxx:8887
id: '44'
language: zh
namespace: Law
#------------project configuration end----------------#
#------------kag-builder configuration start----------------#
kag_builder_pipeline:
chain:
type: unstructured_builder_chain # kag.builder.default_chain.DefaultUnstructuredBuilderChain
extractor:
type: schema_constraint_extractor # kag.builder.component.extractor.schema_constraint_extractor.SchemaConstraintExtractor
llm: *openie_llm
ner_prompt:
type: law_spg_entity # kag.builder.prompt.spg_prompt.SPGEntityPrompt
event_prompt:
type: law_spg_event # kag.builder.prompt.spg_prompt.SPGEventPrompt
std_prompt:
type: default_std # kag.builder.prompt.default.std.OpenIEEntitystandardizationdPrompt
relation_prompt:
type: law_spg_relation # kag.builder.prompt.spg_prompt.SPGRelationPrompt
reader:
type: pdf_reader # kag.builder.component.reader.pdf_reader.PDFReader
post_processor:
type: kag_post_processor # kag.builder.component.postprocessor.kag_postprocessor.KAGPostProcessor
similarity_threshold: 0.9
splitter:
type: outline_splitter # kag.builder.component.splitter.outline_splitter.OutlineSplitter
llm: *openie_llm
vectorizer:
type: batch_vectorizer # kag.builder.component.vectorizer.batch_vectorizer.BatchVectorizer
vectorize_model: *vectorize_model
writer:
type: kg_writer # kag.builder.component.writer.kg_writer.KGWriter
num_threads_per_chain: 1
num_chains: 16
scanner:
type: file_scanner # kag.builder.component.scanner.file_scanner.FileScanner
#------------kag-builder configuration end----------------#
#------------kag-solver configuration start----------------#
search_api: &search_api
type: openspg_search_api #kag.solver.tools.search_api.impl.openspg_search_api.OpenSPGSearchAPI
graph_api: &graph_api
type: openspg_graph_api #kag.solver.tools.graph_api.impl.openspg_graph_api.OpenSPGGraphApi
exact_kg_retriever: &exact_kg_retriever
type: default_exact_kg_retriever # kag.solver.retriever.impl.default_exact_kg_retriever.DefaultExactKgRetriever
el_num: 5
llm_client: *chat_llm
search_api: *search_api
graph_api: *graph_api
fuzzy_kg_retriever: &fuzzy_kg_retriever
type: default_fuzzy_kg_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever
el_num: 5
vectorize_model: *vectorize_model
llm_client: *chat_llm
search_api: *search_api
graph_api: *graph_api
chunk_retriever: &chunk_retriever
type: default_chunk_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever
llm_client: *chat_llm
recall_num: 10
rerank_topk: 10
kag_solver_pipeline:
memory:
type: default_memory # kag.solver.implementation.default_memory.DefaultMemory
llm_client: *chat_llm
max_iterations: 3
reasoner:
type: default_reasoner # kag.solver.implementation.default_reasoner.DefaultReasoner
llm_client: *chat_llm
lf_planner:
type: default_lf_planner # kag.solver.plan.default_lf_planner.DefaultLFPlanner
llm_client: *chat_llm
vectorize_model: *vectorize_model
lf_executor:
type: default_lf_executor # kag.solver.execute.default_lf_executor.DefaultLFExecutor
llm_client: *chat_llm
force_chunk_retriever: true
exact_kg_retriever: *exact_kg_retriever
fuzzy_kg_retriever: *fuzzy_kg_retriever
chunk_retriever: *chunk_retriever
merger:
type: default_lf_sub_query_res_merger # kag.solver.execute.default_sub_query_merger.DefaultLFSubQueryResMerger
vectorize_model: *vectorize_model
chunk_retriever: *chunk_retriever
generator:
type: default_generator # kag.solver.implementation.default_generator.DefaultGenerator
llm_client: *chat_llm
generate_prompt:
type: resp_simple # kag/examples/2wiki/solver/prompt/resp_generator.py
reflector:
type: default_reflector # kag.solver.implementation.default_reflector.DefaultReflector
llm_client: *chat_llm
#------------kag-solver configuration end----------------#
`
2. 在outline_splitter.py main函数中添加pdf路径
`
if name == "main":
from kag.builder.component.splitter.length_splitter import LengthSplitter
from kag.builder.component.reader.docx_reader import DocxReader
from kag.builder.component.reader.txt_reader import TXTReader
from kag.builder.component.reader.pdf_reader import PDFReader
pdf_reader = PDFReader()
docx_reader = DocxReader()
txt_reader = TXTReader()
length_splitter = LengthSplitter(split_length=5000)
`
Are you willing to submit PR?
The text was updated successfully, but these errors were encountered: