ROGRAG: A Robustly Optimized GraphRAG Approach

🔥 Introduction

GraphRAG has many tuning spots, making it hard to discern whether performance gains stem from parameter adjustments or pipeline optimizations. Moreover, RAG test data is embedded in LLM training sets. LLM input tokens impact generation probabilities (background: phi-4 technical report). It's unclear if precision improvements originate from key token searches or retrievals.

Thus, HuixiangDou2 integrated multiple open-source projects (HuixiangDou, KAG, LightRAG, and DB-GPT, totaling 18k lines of code) and conducted comparative experiments on a testset where Qwen2.5-7B-Instruct underperformed. The score rose from 60 to 74.5. Ultimately, a GraphRAG implementation with performance recognized by human domain experts was developed. Here is the report.

Note: The impact of open-source on different fields/industries varies. Since licensing restriction, we can only give the code and test conclusions, and the test data cannot be provided.

📖 Documentation

If it is useful to you, please star it ⭐

🔆 Version Description

Compared to HuixiangDou1, this repo improves accuracy:

Graph Schema. Dense retrieval is only for querying similar entities and relationships.
Ported/merged multiple open-source implementations, with code differences of nearly 18k lines:
- Data. Organized a set of real domain knowledge that LLM has not fully seen for testing (gpt accuracy < 0.6)
- Ablation. Confirmed the impact of different stages and parameters on accuracy
- Improvement. As shown below.

API remains compatible. That means Wechat/Lark/Web in v1 is also accessible.

# v1 API https://github.com/InternLM/HuixiangDou/blob/main/huixiangdou/service/parallel_pipeline.py#L290
async def generate(self,
            query: Union[Query, str],
            history: List[Tuple[str]]=[], 
            language: str='zh', 
            enable_web_search: bool=True,
            enable_code_search: bool=True):

# v2 API https://github.com/tpoisonooo/HuixiangDou2/blob/main/huixiangdou/pipeline/parallel.py#L135
async def generate(self,
                query: Union[Query, str],
                history: List[Pair] = [],
                request_id: str = 'default',
                language: str = 'zh_cn'):

🍀 Acknowledgements

SiliconCloud Abundant LLM API, some models are free
KAG Graph retrieval based on reasoning
DB-GPT LLM tool collection
LightRAG Simple and efficient graph retrieval solution

📝 Citation

@misc{kong2024huixiangdou,
      title={HuiXiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance},
      author={Huanjun Kong and Songyang Zhang and Jiaying Li and Min Xiao and Jun Xu and Kai Chen},
      year={2024},
      eprint={2401.08772},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{kong2024labelingsupervisedfinetuningdata,
      title={Labeling supervised fine-tuning data with the scaling law}, 
      author={Huanjun Kong},
      year={2024},
      eprint={2405.02817},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2405.02817}, 
}

@misc{kong2025huixiangdou2robustlyoptimizedgraphrag,
      title={HuixiangDou2: A Robustly Optimized GraphRAG Approach}, 
      author={Huanjun Kong and Zhefan Wang and Chenyang Wang and Zhe Ma and Nanqing Dong},
      year={2025},
      eprint={2503.06474},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2503.06474}, 
}

Name	Name	Last commit message	Last commit date
Latest commit tpoisonooo Update README.md Mar 28, 2025 cfac2db · Mar 28, 2025 History 43 Commits
docs	docs	docs(readme): support docker run (#23 )	Mar 18, 2025
evaluation	evaluation	Delete evaluation/end2end/qa.jsonl	Mar 14, 2025
huixiangdou	huixiangdou	Update serial.py	Mar 14, 2025
requirements	requirements	feat(huixiangdou): add server (#17 )	Feb 25, 2025
resource	resource	refactor(knowledge): new graph-based retrieval method (#2 )	Jan 6, 2025
tests	tests	feat(huixiangdou): add server (#17 )	Feb 25, 2025
unittest	unittest	feat(faiss.py):faiss quant (#19 )	Mar 14, 2025
.gitignore	.gitignore	refactor(knowledge): new graph-based retrieval method (#2 )	Jan 6, 2025
.pylintrc	.pylintrc	refactor(knowledge): new graph-based retrieval method (#2 )	Jan 6, 2025
.readthedocs.yaml	.readthedocs.yaml	refactor(knowledge): new graph-based retrieval method (#2 )	Jan 6, 2025
LICENSE	LICENSE	refactor(knowledge): new graph-based retrieval method (#2 )	Jan 6, 2025
README.md	README.md	Update README.md	Mar 28, 2025
README_zh_cn.md	README_zh_cn.md	docs(readme): support docker run (#23 )	Mar 18, 2025
config.ini.cpu-only-example	config.ini.cpu-only-example	feat(huixiangdou): add server (#17 )	Feb 25, 2025
config.ini.example	config.ini.example	feat(huixiangdou): add server (#17 )	Feb 25, 2025
requirements.txt	requirements.txt	Update requirements.txt (#22 )	Mar 13, 2025
setup.py	setup.py	refactor(knowledge): new graph-based retrieval method (#2 )	Jan 6, 2025
setup.sh	setup.sh	refactor(knowledge): new graph-based retrieval method (#2 )	Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROGRAG: A Robustly Optimized GraphRAG Approach

🔥 Introduction

📖 Documentation

🔆 Version Description

🍀 Acknowledgements

📝 Citation

About

Contributors 2

Languages

License

tpoisonooo/ROGRAG

Folders and files

Latest commit

History

Repository files navigation

ROGRAG: A Robustly Optimized GraphRAG Approach

🔥 Introduction

📖 Documentation

🔆 Version Description

🍀 Acknowledgements

📝 Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages