Skip to content
/ mmMamba Public

The first decoder-only multimodal state space model

License

Notifications You must be signed in to change notification settings

hustvl/mmMamba

Repository files navigation

mmMamba

Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Bencheng Liao1,2,*, Hongyuan Tao2,*, Qian Zhang3, Tianheng Cheng2, Yingyue Li2, Haoran Yin3, Wenyu Liu2, Xinggang Wang2 📧

1 Institute of Artificial Intelligence, HUST, 2 School of EIC, HUST, 3 Horizon Robotics

* equal contribution, 📧 corresponding author, xgwang@hust.edu.cn

mmMamba  huggingface weights  huggingface weights 

News

  • Feb. 19th, 2025: We released our paper on Arxiv. We release the initial version of code and weights.

Table of Contents

Introduction

We propose mmMamba, the first decoder-only multimodal state space model achieved through quadratic to linear distillation using moderate academic computing resources. Unlike existing linear-complexity encoder-based multimodal large language models (MLLMs), mmMamba eliminates the need for separate vision encoders and underperforming pre-trained RNN-based LLMs. Through our seeding strategy and three-stage progressive distillation recipe, mmMamba effectively transfers knowledge from quadratic-complexity decoder-only pre-trained MLLMs while preserving multimodal capabilities. Additionally, mmMamba introduces flexible hybrid architectures that strategically combine Transformer and Mamba layers, enabling customizable trade-offs between computational efficiency and model performance.

Distilled from the decoder-only HoVLE-2.6B, our pure Mamba-2-based mmMamba-linear achieves performance competitive with existing linear and quadratic-complexity VLMs, including those with 2x larger parameter size like EVE-7B. The hybrid variant, mmMamba-hybrid, further enhances performance across all benchmarks, approaching the capabilities of the teacher model HoVLE. In long-context scenarios with 103K tokens, mmMamba-linear demonstrates remarkable efficiency gains with a 20.6× speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves a 13.5× speedup and 60.2% memory savings.

Seeding strategy and three-stage distillation pipeline of mmMamba.

Getting Started

Acknowledgement

mmMamba is greatly inspired by the following outstanding contributions to the open-source community: mamba, LolCATs, phi-mamba, MambaInLlama, HoVLE, SOLO, flash-linear-attention.

Citation

If you find mmMamba is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

 @article{mmMamba,
  title={mmMamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation},
  author={Bencheng Liao and Hongyuan Tao and Qian Zhang and Tianheng Cheng and Yingyue Li and Haoran Yin and Wenyu Liu and Xinggang Wang},
  journal={arXiv preprint arXiv:2502.13145},
  year={2025}
}

About

The first decoder-only multimodal state space model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published