Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [KAG] outline_splitter.py多线程http异步请求大模型API存在bug #327

Open
1 of 2 tasks
SwordfallYeung opened this issue Jan 25, 2025 · 2 comments
Open
1 of 2 tasks
Assignees

Comments

@SwordfallYeung
Copy link

SwordfallYeung commented Jan 25, 2025

Search before asking

  • I had searched in the issues and found no similar issues.

Operating system information

Windows

What happened

enviroment:
kag 0.6 python3.9 Windows10

question:
使用outline_splitter对10页pdf文本进行按标题截取内容时,发现会起10个线程去http异步请求大模型API,但是在返回10个结果中却是有不少重复的标题,具体如下:
这个是开启pycharm debug模式下出现比较严重的情况:
Image

不开启debug模式,正常跑也会至少出现3个以上的重复结果数据
Image

How to reproduce

  1. 在kag/examples编辑kag_config.yaml,内容如下:
    `
    #------------project configuration start----------------#
    openie_llm: &openie_llm
    api_key: token-abc123
    base_url: http://xxxx:8889/v1/chat/completions
    model: qwen2.5-32b-instruct-awq
    type: vllm

chat_llm: &chat_llm
api_key: token-abc123
base_url: http://xxxx:8889/v1/chat/completions
model: qwen2.5-32b-instruct-awq
type: vllm

vectorize_model: &vectorize_model
api_key: EMPTY
base_url: http://xxxx:6010/v1
model: lier007xiaobu_embedding_v2
type: openai
vector_dimensions: 1536
vectorizer: *vectorize_model

log:
level: DEBUG

project:
biz_scene: default
host_addr: http://xxxx:8887
id: '44'
language: zh
namespace: Law
#------------project configuration end----------------#

#------------kag-builder configuration start----------------#
kag_builder_pipeline:
chain:
type: unstructured_builder_chain # kag.builder.default_chain.DefaultUnstructuredBuilderChain
extractor:
type: schema_constraint_extractor # kag.builder.component.extractor.schema_constraint_extractor.SchemaConstraintExtractor
llm: *openie_llm
ner_prompt:
type: law_spg_entity # kag.builder.prompt.spg_prompt.SPGEntityPrompt
event_prompt:
type: law_spg_event # kag.builder.prompt.spg_prompt.SPGEventPrompt
std_prompt:
type: default_std # kag.builder.prompt.default.std.OpenIEEntitystandardizationdPrompt
relation_prompt:
type: law_spg_relation # kag.builder.prompt.spg_prompt.SPGRelationPrompt
reader:
type: pdf_reader # kag.builder.component.reader.pdf_reader.PDFReader
post_processor:
type: kag_post_processor # kag.builder.component.postprocessor.kag_postprocessor.KAGPostProcessor
similarity_threshold: 0.9
splitter:
type: outline_splitter # kag.builder.component.splitter.outline_splitter.OutlineSplitter
llm: *openie_llm
vectorizer:
type: batch_vectorizer # kag.builder.component.vectorizer.batch_vectorizer.BatchVectorizer
vectorize_model: *vectorize_model
writer:
type: kg_writer # kag.builder.component.writer.kg_writer.KGWriter
num_threads_per_chain: 1
num_chains: 16
scanner:
type: file_scanner # kag.builder.component.scanner.file_scanner.FileScanner
#------------kag-builder configuration end----------------#

#------------kag-solver configuration start----------------#
search_api: &search_api
type: openspg_search_api #kag.solver.tools.search_api.impl.openspg_search_api.OpenSPGSearchAPI

graph_api: &graph_api
type: openspg_graph_api #kag.solver.tools.graph_api.impl.openspg_graph_api.OpenSPGGraphApi

exact_kg_retriever: &exact_kg_retriever
type: default_exact_kg_retriever # kag.solver.retriever.impl.default_exact_kg_retriever.DefaultExactKgRetriever
el_num: 5
llm_client: *chat_llm
search_api: *search_api
graph_api: *graph_api

fuzzy_kg_retriever: &fuzzy_kg_retriever
type: default_fuzzy_kg_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever
el_num: 5
vectorize_model: *vectorize_model
llm_client: *chat_llm
search_api: *search_api
graph_api: *graph_api

chunk_retriever: &chunk_retriever
type: default_chunk_retriever # kag.solver.retriever.impl.default_fuzzy_kg_retriever.DefaultFuzzyKgRetriever
llm_client: *chat_llm
recall_num: 10
rerank_topk: 10

kag_solver_pipeline:
memory:
type: default_memory # kag.solver.implementation.default_memory.DefaultMemory
llm_client: *chat_llm
max_iterations: 3
reasoner:
type: default_reasoner # kag.solver.implementation.default_reasoner.DefaultReasoner
llm_client: *chat_llm
lf_planner:
type: default_lf_planner # kag.solver.plan.default_lf_planner.DefaultLFPlanner
llm_client: *chat_llm
vectorize_model: *vectorize_model
lf_executor:
type: default_lf_executor # kag.solver.execute.default_lf_executor.DefaultLFExecutor
llm_client: *chat_llm
force_chunk_retriever: true
exact_kg_retriever: *exact_kg_retriever
fuzzy_kg_retriever: *fuzzy_kg_retriever
chunk_retriever: *chunk_retriever
merger:
type: default_lf_sub_query_res_merger # kag.solver.execute.default_sub_query_merger.DefaultLFSubQueryResMerger
vectorize_model: *vectorize_model
chunk_retriever: *chunk_retriever
generator:
type: default_generator # kag.solver.implementation.default_generator.DefaultGenerator
llm_client: *chat_llm
generate_prompt:
type: resp_simple # kag/examples/2wiki/solver/prompt/resp_generator.py
reflector:
type: default_reflector # kag.solver.implementation.default_reflector.DefaultReflector
llm_client: *chat_llm

#------------kag-solver configuration end----------------#
`
2. 在outline_splitter.py main函数中添加pdf路径

`
if name == "main":
from kag.builder.component.splitter.length_splitter import LengthSplitter
from kag.builder.component.reader.docx_reader import DocxReader
from kag.builder.component.reader.txt_reader import TXTReader
from kag.builder.component.reader.pdf_reader import PDFReader
pdf_reader = PDFReader()
docx_reader = DocxReader()
txt_reader = TXTReader()
length_splitter = LengthSplitter(split_length=5000)

llm = LLMClient.from_config(KAG_CONFIG.all_config["openie_llm"])
outline_splitter = OutlineSplitter(llm=llm)
txt_path = os.path.join(
    os.path.dirname(__file__), "../../../../tests/builder/data/儿科学_short.txt"
)
docx_path = "/Users/zhangxinhong.zxh/Downloads/waikexue_short.docx"
# test_dir = "/Users/zhangxinhong.zxh/Downloads/1127_medkag_book"
# pdf_path = "/Users/zhangxinhong.zxh/Downloads/toaz.info-5dsm-5-pr_56e68a629dc4fe62699960dd5afbe362.pdf"
pdf_path = os.path.join(
    os.path.dirname(__file__), "../../../examples/law/builder/data/xxxx.pdf"
)
# files = [
#     os.path.join(test_dir, file)
#     for file in os.listdir(test_dir)
#     if file.endswith(".docx")
# ]
# files = [
#     files[0],
# ]

# with ThreadPoolExecutor(max_workers=10) as executor:
#     futures = [executor.submit(process_file, file) for file in files]

# for future in as_completed(futures):
#     print(future.result())

# process_file_without_chain(docx_path)
a = 1
# chunk = docx_reader.invoke(docx_path)
# chunk = txt_reader.invoke(txt_path)
chunk = pdf_reader.invoke(pdf_path)
# chunks = length_splitter.invoke(chunk)
chunks = outline_splitter.invoke(chunk)
print(chunks)

`

  1. 标题出现重复数据
    Image

Are you willing to submit PR?

  • Yes I am willing to submit a PR!
@caszkgui
Copy link
Collaborator

pdf_path

Could you provide your testset file to help us to reproduce your work?

@SwordfallYeung
Copy link
Author

Of course.

法律测试文件.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants