[Bug] [KAG] outline_splitter.py多线程http异步请求大模型API存在bug #327

SwordfallYeung · 2025-01-25T02:03:58Z

Search before asking

I had searched in the issues and found no similar issues.

Operating system information

Windows

What happened

enviroment:
kag 0.6 python3.9 Windows10

question:
使用outline_splitter对10页pdf文本进行按标题截取内容时，发现会起10个线程去http异步请求大模型API，但是在返回10个结果中却是有不少重复的标题，具体如下：
这个是开启pycharm debug模式下出现比较严重的情况：

不开启debug模式，正常跑也会至少出现3个以上的重复结果数据

How to reproduce

在kag/examples编辑kag_config.yaml，内容如下：
`
#------------project configuration start----------------#
openie_llm: &openie_llm
api_key: token-abc123
base_url: http://xxxx:8889/v1/chat/completions
model: qwen2.5-32b-instruct-awq
type: vllm

chat_llm: &chat_llm
api_key: token-abc123
base_url: http://xxxx:8889/v1/chat/completions
model: qwen2.5-32b-instruct-awq
type: vllm

vectorize_model: &vectorize_model
api_key: EMPTY
base_url: http://xxxx:6010/v1
model: lier007xiaobu_embedding_v2
type: openai
vector_dimensions: 1536
vectorizer: *vectorize_model

log:
level: DEBUG

project:
biz_scene: default
host_addr: http://xxxx:8887
id: '44'
language: zh
namespace: Law
#------------project configuration end----------------#

#------------kag-builder configuration start----------------#
kag_builder_pipeline:
chain:
type: unstructured_builder_chain # kag.builder.default_chain.DefaultUnstructuredBuilderChain
extractor:
type: schema_constraint_extractor # kag.builder.component.extractor.schema_constraint_extractor.SchemaConstraintExtractor
llm: *openie_llm
ner_prompt:
type: law_spg_entity # kag.builder.prompt.spg_prompt.SPGEntityPrompt
event_prompt:
type: law_spg_event # kag.builder.prompt.spg_prompt.SPGEventPrompt
std_prompt:
type: default_std # kag.builder.prompt.default.std.OpenIEEntitystandardizationdPrompt
relation_prompt:
type: law_spg_relation # kag.builder.prompt.spg_prompt.SPGRelationPrompt
reader:
type: pdf_reader # kag.builder.component.reader.pdf_reader.PDFReader
post_processor:
type: kag_post_processor # kag.builder.component.postprocessor.kag_postprocessor.KAGPostProcessor
similarity_threshold: 0.9
splitter:
type: outline_splitter # kag.builder.component.splitter.outline_splitter.OutlineSplitter
llm: *openie_llm
vectorizer:
type: batch_vectorizer # kag.builder.component.vectorizer.batch_vectorizer.BatchVectorizer
vectorize_model: *vectorize_model
writer:
type: kg_writer # kag.builder.component.writer.kg_writer.KGWriter
num_threads_per_chain: 1
num_chains: 16
scanner:
type: file_scanner # kag.builder.component.scanner.file_scanner.FileScanner
#------------kag-builder configuration end----------------#

#------------kag-solver configuration start----------------#
search_api: &search_api
type: openspg_search_api #kag.solver.tools.search_api.impl.openspg_search_api.OpenSPGSearchAPI

graph_api: &graph_api
type: openspg_graph_api #kag.solver.tools.graph_api.impl.openspg_graph_api.OpenSPGGraphApi

exact_kg_retriever: &exact_kg_retriever
type: default_exact_kg_retriever # kag.solver.retriever.impl.default_exact_kg_retriever.DefaultExactKgRetriever
el_num: 5
llm_client: *chat_llm
search_api: *search_api
graph_api: *graph_api

fuzzy_kg_retriever: &fuzzy_kg_retriever
type: default_fuzzy_kg_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever
el_num: 5
vectorize_model: *vectorize_model
llm_client: *chat_llm
search_api: *search_api
graph_api: *graph_api

chunk_retriever: &chunk_retriever
type: default_chunk_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever
llm_client: *chat_llm
recall_num: 10
rerank_topk: 10

kag_solver_pipeline:
memory:
type: default_memory # kag.solver.implementation.default_memory.DefaultMemory
llm_client: *chat_llm
max_iterations: 3
reasoner:
type: default_reasoner # kag.solver.implementation.default_reasoner.DefaultReasoner
llm_client: *chat_llm
lf_planner:
type: default_lf_planner # kag.solver.plan.default_lf_planner.DefaultLFPlanner
llm_client: *chat_llm
vectorize_model: *vectorize_model
lf_executor:
type: default_lf_executor # kag.solver.execute.default_lf_executor.DefaultLFExecutor
llm_client: *chat_llm
force_chunk_retriever: true
exact_kg_retriever: *exact_kg_retriever
fuzzy_kg_retriever: *fuzzy_kg_retriever
chunk_retriever: *chunk_retriever
merger:
type: default_lf_sub_query_res_merger # kag.solver.execute.default_sub_query_merger.DefaultLFSubQueryResMerger
vectorize_model: *vectorize_model
chunk_retriever: *chunk_retriever
generator:
type: default_generator # kag.solver.implementation.default_generator.DefaultGenerator
llm_client: *chat_llm
generate_prompt:
type: resp_simple # kag/examples/2wiki/solver/prompt/resp_generator.py
reflector:
type: default_reflector # kag.solver.implementation.default_reflector.DefaultReflector
llm_client: *chat_llm

#------------kag-solver configuration end----------------#
`
2. 在outline_splitter.py main函数中添加pdf路径

`
if name == "main":
from kag.builder.component.splitter.length_splitter import LengthSplitter
from kag.builder.component.reader.docx_reader import DocxReader
from kag.builder.component.reader.txt_reader import TXTReader
from kag.builder.component.reader.pdf_reader import PDFReader
pdf_reader = PDFReader()
docx_reader = DocxReader()
txt_reader = TXTReader()
length_splitter = LengthSplitter(split_length=5000)

llm = LLMClient.from_config(KAG_CONFIG.all_config["openie_llm"])
outline_splitter = OutlineSplitter(llm=llm)
txt_path = os.path.join(
    os.path.dirname(__file__), "../../../../tests/builder/data/儿科学_short.txt"
)
docx_path = "/Users/zhangxinhong.zxh/Downloads/waikexue_short.docx"
# test_dir = "/Users/zhangxinhong.zxh/Downloads/1127_medkag_book"
# pdf_path = "/Users/zhangxinhong.zxh/Downloads/toaz.info-5dsm-5-pr_56e68a629dc4fe62699960dd5afbe362.pdf"
pdf_path = os.path.join(
    os.path.dirname(__file__), "../../../examples/law/builder/data/xxxx.pdf"
)
# files = [
#     os.path.join(test_dir, file)
#     for file in os.listdir(test_dir)
#     if file.endswith(".docx")
# ]
# files = [
#     files[0],
# ]

# with ThreadPoolExecutor(max_workers=10) as executor:
#     futures = [executor.submit(process_file, file) for file in files]

# for future in as_completed(futures):
#     print(future.result())

# process_file_without_chain(docx_path)
a = 1
# chunk = docx_reader.invoke(docx_path)
# chunk = txt_reader.invoke(txt_path)
chunk = pdf_reader.invoke(pdf_path)
# chunks = length_splitter.invoke(chunk)
chunks = outline_splitter.invoke(chunk)
print(chunks)

`

标题出现重复数据

Are you willing to submit PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

caszkgui · 2025-02-11T01:37:46Z

pdf_path

Could you provide your testset file to help us to reproduce your work？

SwordfallYeung · 2025-02-12T03:22:43Z

Of course.

法律测试文件.pdf

caszkgui assigned northmachine Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [KAG] outline_splitter.py多线程http异步请求大模型API存在bug #327

[Bug] [KAG] outline_splitter.py多线程http异步请求大模型API存在bug #327

SwordfallYeung commented Jan 25, 2025 •

edited

Loading

caszkgui commented Feb 11, 2025

SwordfallYeung commented Feb 12, 2025

[Bug] [KAG] outline_splitter.py多线程http异步请求大模型API存在bug #327

[Bug] [KAG] outline_splitter.py多线程http异步请求大模型API存在bug #327

Comments

SwordfallYeung commented Jan 25, 2025 • edited Loading

Search before asking

Operating system information

What happened

How to reproduce

Are you willing to submit PR?

caszkgui commented Feb 11, 2025

SwordfallYeung commented Feb 12, 2025

SwordfallYeung commented Jan 25, 2025 •

edited

Loading