Multi-modal Retrieval Augmented Multi-modal Generation

Implementation of paper Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

🔥 News

2025 Feb 19: Paper updated on arXiv. Code and other resources are released. Online demo is coming soon.
2024 Nov 25: Paper available on arXiv.

📋 Table of Contents

Multi-modal Retrieval Augmented Multi-modal Generation

🤗 Resources

Item	Repository
Benchmark Dataset	🤗 ylwt/M2RAG-Bench
Training Dataset	🤗 ylwt/M2RAG-Distill-GPT-4o
Distilled Model - Llama-3.1-8B-Instruct	🤗 ylwt/M2RAG-Llama-3.1-8B-Instruct
Distilled Model - Qwen2.5-7B-Instruct	🤗 ylwt/M2RAG-Qwen2.5-7B-Instruct
Distilled Model - Qwen2-VL-7B-Instruct	🤗 ylwt/M2RAG-Qwen2-VL-7B-Instruct

🚀 Getting Started

🛠️ Installation

Create and activate a Conda environment:

conda create -n m2rag python=3.12 -y
conda activate m2rag

Clone the repository and install dependencies:

git clone https://github.com/maziao/M2RAG.git
cd M2RAG
pip install -r requirements.txt

To use our fine-tuned models or fine-tune your own, install LLaMA-Factory.
Configure environment variables (make sure to modify the .env file first to fill in your API keys and other necessary details):
```
source .env
```

CLI Demo

To start a new session:

python cli_demo.py --query "How to fold a paper airplane step by step?"

To resume from an interrupted session:

python cli_demo.py --query "How to fold a paper airplane step by step?" --session-id "20250219-xxxxxx"

📈 Evaluation on M$^2$RAG-Bench

Download the benchmark dataset:

bash scripts/download_benchmark_dataset.sh

Generate M$^2$RAG results:

Customize your configuration for LLM or MLLM:

python summarize.py --config-file ./src/config/summarize_custom_mllm.yaml

Evaluate the results:

You can customize the evaluation configuration here.

python evaluate.py \
  --summarize-log-dir ./log/summarize/summarizer-single_stage-vlm-openai-gpt-4o-2024-08-06 \
  --config-file ./src/config/evaluate_custom.yaml

How to Use Our Fine-tuned Models

Take Llama-3.1-8B-Instruct as an example:

Download the base model checkpoint:

Note: You need to agree to share your contact information to access Llama-3.1-8B-Instruct.

# If the checkpoint has not been downloaded:
mkdir -p models/Llama-3.1-8B-Instruct/original
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
  --local-dir models/Llama-3.1-8B-Instruct/original

# If the checkpoint has already been downloaded:
mkdir -p models/Llama-3.1-8B-Instruct
ln -s PATH_TO_CHECKPOINT models/Llama-3.1-8B-Instruct/original

Download the LoRA adapter:

mkdir -p models/Llama-3.1-8B-Instruct/LoRA
huggingface-cli download ylwt/M2RAG-Llama-3.1-8B-Instruct \
  --local-dir models/Llama-3.1-8B-Instruct/LoRA

Merge the LoRA adapter with the base checkpoint:

bash finetune/scripts/llama_factory_merge_lora.sh finetune/config/merge/llama_3_1_merge_lora.yaml 0

Deploy with vLLM:
```
bash scripts/vllm/deploy_llama_3_1.sh
```

🔧 Fine-tuning

Take fine-tuning Llama-3.1-8B-Instruct on 2 × NVIDIA A100-SXM4-80GB GPUs as an example. You can modify your configuration here to suit different setups. Be sure to prepare the base models following the instructions in the previous section.

Download the training dataset:

bash scripts/download_training_dataset.sh

Convert the raw dataset to ShareGPT format:

bash finetune/scripts/prepare_llama_factory_dataset.sh

Start fine-tuning:

bash finetune/scripts/llama_factory_finetune.sh finetune/config/sft/llama_3_1_lora_sft_ds2.yaml 2 0,1

♻️ Reproduce Our Work

For detailed instructions on reproducing our work, please refer to this guide.

📎 Citation

If you find this repository useful for your research, please cite our paper:

@misc{ma2025multimodalretrievalaugmentedmultimodal,
      title={Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines}, 
      author={Zi-Ao Ma and Tian Lan and Rong-Cheng Tu and Yong Hu and Yu-Shi Zhu and Tong Zhang and Heyan Huang and Xian-Ling Mao},
      year={2025},
      eprint={2411.16365},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.16365}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
figs		figs
finetune		finetune
scripts		scripts
src		src
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
REPRODUCE.md		REPRODUCE.md
cli_demo.py		cli_demo.py
evaluate.py		evaluate.py
launch_watcher.py		launch_watcher.py
requirements.txt		requirements.txt
summarize.py		summarize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-modal Retrieval Augmented Multi-modal Generation

🔥 News

📋 Table of Contents

🤗 Resources

🚀 Getting Started

🛠️ Installation

CLI Demo

📈 Evaluation on M$^2$RAG-Bench

How to Use Our Fine-tuned Models

🔧 Fine-tuning

♻️ Reproduce Our Work

📎 Citation

About

Languages

License

maziao/M2RAG

Folders and files

Latest commit

History

Repository files navigation

Multi-modal Retrieval Augmented Multi-modal Generation

🔥 News

📋 Table of Contents

🤗 Resources

🚀 Getting Started

🛠️ Installation

CLI Demo

📈 Evaluation on M$^2$RAG-Bench

How to Use Our Fine-tuned Models

🔧 Fine-tuning

♻️ Reproduce Our Work

📎 Citation

About

Resources

License

Stars

Watchers

Forks

Languages