Skip to content
/ M2RAG Public

Implementation of "Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines"

License

Notifications You must be signed in to change notification settings

maziao/M2RAG

Repository files navigation

Multi-modal Retrieval Augmented Multi-modal Generation

arXiv hf License GitHub stars

Implementation of paper Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines

Illustration of M^2RAG framework

🔥 News

  • 2025 Feb 19: Paper updated on arXiv. Code and other resources are released. Online demo is coming soon.
  • 2024 Nov 25: Paper available on arXiv.

📋 Table of Contents

🤗 Resources

Item Repository
Benchmark Dataset 🤗 ylwt/M2RAG-Bench
Training Dataset 🤗 ylwt/M2RAG-Distill-GPT-4o
Distilled Model - Llama-3.1-8B-Instruct 🤗 ylwt/M2RAG-Llama-3.1-8B-Instruct
Distilled Model - Qwen2.5-7B-Instruct 🤗 ylwt/M2RAG-Qwen2.5-7B-Instruct
Distilled Model - Qwen2-VL-7B-Instruct 🤗 ylwt/M2RAG-Qwen2-VL-7B-Instruct

🚀 Getting Started

🛠️ Installation

  1. Create and activate a Conda environment:

    conda create -n m2rag python=3.12 -y
    conda activate m2rag
  2. Clone the repository and install dependencies:

    git clone https://github.com/maziao/M2RAG.git
    cd M2RAG
    pip install -r requirements.txt
  3. To use our fine-tuned models or fine-tune your own, install LLaMA-Factory.

  4. Configure environment variables (make sure to modify the .env file first to fill in your API keys and other necessary details):

    source .env

CLI Demo

  • To start a new session:

    python cli_demo.py --query "How to fold a paper airplane step by step?"
  • To resume from an interrupted session:

    python cli_demo.py --query "How to fold a paper airplane step by step?" --session-id "20250219-xxxxxx"

📈 Evaluation on M$^2$RAG-Bench

  1. Download the benchmark dataset:

    bash scripts/download_benchmark_dataset.sh
  2. Generate M$^2$RAG results:

    Customize your configuration for LLM or MLLM:

    python summarize.py --config-file ./src/config/summarize_custom_mllm.yaml
  3. Evaluate the results:

    You can customize the evaluation configuration here.

    python evaluate.py \
      --summarize-log-dir ./log/summarize/summarizer-single_stage-vlm-openai-gpt-4o-2024-08-06 \
      --config-file ./src/config/evaluate_custom.yaml

How to Use Our Fine-tuned Models

Take Llama-3.1-8B-Instruct as an example:

  1. Download the base model checkpoint:

    Note: You need to agree to share your contact information to access Llama-3.1-8B-Instruct.

    # If the checkpoint has not been downloaded:
    mkdir -p models/Llama-3.1-8B-Instruct/original
    huggingface-cli download meta-llama/Llama-3.1-8B-Instruct \
      --local-dir models/Llama-3.1-8B-Instruct/original
    
    # If the checkpoint has already been downloaded:
    mkdir -p models/Llama-3.1-8B-Instruct
    ln -s PATH_TO_CHECKPOINT models/Llama-3.1-8B-Instruct/original
  2. Download the LoRA adapter:

    mkdir -p models/Llama-3.1-8B-Instruct/LoRA
    huggingface-cli download ylwt/M2RAG-Llama-3.1-8B-Instruct \
      --local-dir models/Llama-3.1-8B-Instruct/LoRA
  3. Merge the LoRA adapter with the base checkpoint:

    bash finetune/scripts/llama_factory_merge_lora.sh finetune/config/merge/llama_3_1_merge_lora.yaml 0
  4. Deploy with vLLM:

    bash scripts/vllm/deploy_llama_3_1.sh

🔧 Fine-tuning

Take fine-tuning Llama-3.1-8B-Instruct on 2 × NVIDIA A100-SXM4-80GB GPUs as an example. You can modify your configuration here to suit different setups. Be sure to prepare the base models following the instructions in the previous section.

  1. Download the training dataset:

    bash scripts/download_training_dataset.sh
  2. Convert the raw dataset to ShareGPT format:

    bash finetune/scripts/prepare_llama_factory_dataset.sh
  3. Start fine-tuning:

    bash finetune/scripts/llama_factory_finetune.sh finetune/config/sft/llama_3_1_lora_sft_ds2.yaml 2 0,1

♻️ Reproduce Our Work

For detailed instructions on reproducing our work, please refer to this guide.

📎 Citation

If you find this repository useful for your research, please cite our paper:

@misc{ma2025multimodalretrievalaugmentedmultimodal,
      title={Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines}, 
      author={Zi-Ao Ma and Tian Lan and Rong-Cheng Tu and Yong Hu and Yu-Shi Zhu and Tong Zhang and Heyan Huang and Xian-Ling Mao},
      year={2025},
      eprint={2411.16365},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.16365}, 
}

About

Implementation of "Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines"

Resources

License

Stars

Watchers

Forks