Implementation of paper Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines
- 2025 Feb 19: Paper updated on arXiv. Code and other resources are released. Online demo is coming soon.
- 2024 Nov 25: Paper available on arXiv.
Item | Repository |
---|---|
Benchmark Dataset | 🤗 ylwt/M2RAG-Bench |
Training Dataset | 🤗 ylwt/M2RAG-Distill-GPT-4o |
Distilled Model - Llama-3.1-8B-Instruct | 🤗 ylwt/M2RAG-Llama-3.1-8B-Instruct |
Distilled Model - Qwen2.5-7B-Instruct | 🤗 ylwt/M2RAG-Qwen2.5-7B-Instruct |
Distilled Model - Qwen2-VL-7B-Instruct | 🤗 ylwt/M2RAG-Qwen2-VL-7B-Instruct |
-
Create and activate a Conda environment:
conda create -n m2rag python=3.12 -y conda activate m2rag
-
Clone the repository and install dependencies:
git clone https://github.com/maziao/M2RAG.git cd M2RAG pip install -r requirements.txt
-
To use our fine-tuned models or fine-tune your own, install LLaMA-Factory.
-
Configure environment variables (make sure to modify the .env file first to fill in your API keys and other necessary details):
source .env
-
To start a new session:
python cli_demo.py --query "How to fold a paper airplane step by step?"
-
To resume from an interrupted session:
python cli_demo.py --query "How to fold a paper airplane step by step?" --session-id "20250219-xxxxxx"
-
Download the benchmark dataset:
bash scripts/download_benchmark_dataset.sh
-
Generate M
$^2$ RAG results:Customize your configuration for LLM or MLLM:
python summarize.py --config-file ./src/config/summarize_custom_mllm.yaml
-
Evaluate the results:
You can customize the evaluation configuration here.
python evaluate.py \ --summarize-log-dir ./log/summarize/summarizer-single_stage-vlm-openai-gpt-4o-2024-08-06 \ --config-file ./src/config/evaluate_custom.yaml
Take Llama-3.1-8B-Instruct as an example:
-
Download the base model checkpoint:
Note: You need to agree to share your contact information to access Llama-3.1-8B-Instruct.
# If the checkpoint has not been downloaded: mkdir -p models/Llama-3.1-8B-Instruct/original huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \ --local-dir models/Llama-3.1-8B-Instruct/original # If the checkpoint has already been downloaded: mkdir -p models/Llama-3.1-8B-Instruct ln -s PATH_TO_CHECKPOINT models/Llama-3.1-8B-Instruct/original
-
Download the LoRA adapter:
mkdir -p models/Llama-3.1-8B-Instruct/LoRA huggingface-cli download ylwt/M2RAG-Llama-3.1-8B-Instruct \ --local-dir models/Llama-3.1-8B-Instruct/LoRA
-
Merge the LoRA adapter with the base checkpoint:
bash finetune/scripts/llama_factory_merge_lora.sh finetune/config/merge/llama_3_1_merge_lora.yaml 0
-
Deploy with vLLM:
bash scripts/vllm/deploy_llama_3_1.sh
Take fine-tuning Llama-3.1-8B-Instruct on 2 × NVIDIA A100-SXM4-80GB GPUs as an example. You can modify your configuration here to suit different setups. Be sure to prepare the base models following the instructions in the previous section.
-
Download the training dataset:
bash scripts/download_training_dataset.sh
-
Convert the raw dataset to ShareGPT format:
bash finetune/scripts/prepare_llama_factory_dataset.sh
-
Start fine-tuning:
bash finetune/scripts/llama_factory_finetune.sh finetune/config/sft/llama_3_1_lora_sft_ds2.yaml 2 0,1
For detailed instructions on reproducing our work, please refer to this guide.
If you find this repository useful for your research, please cite our paper:
@misc{ma2025multimodalretrievalaugmentedmultimodal,
title={Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines},
author={Zi-Ao Ma and Tian Lan and Rong-Cheng Tu and Yong Hu and Yu-Shi Zhu and Tong Zhang and Heyan Huang and Xian-Ling Mao},
year={2025},
eprint={2411.16365},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.16365},
}