mmMamba

Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

Bencheng Liao^1,2,*, Hongyuan Tao^2,*, Qian Zhang³, Tianheng Cheng², Yingyue Li², Haoran Yin³, Wenyu Liu², Xinggang Wang^{2 📧}

¹ Institute of Artificial Intelligence, HUST, ² School of EIC, HUST, ³ Horizon Robotics

^* equal contribution, ^📧 corresponding author, xgwang@hust.edu.cn

News

Feb. 19th, 2025: We released our paper on Arxiv. We release the initial version of code and weights.

Introduction

We propose mmMamba, the first decoder-only multimodal state space model achieved through quadratic to linear distillation using moderate academic computing resources. Unlike existing linear-complexity encoder-based multimodal large language models (MLLMs), mmMamba eliminates the need for separate vision encoders and underperforming pre-trained RNN-based LLMs. Through our seeding strategy and three-stage progressive distillation recipe, mmMamba effectively transfers knowledge from quadratic-complexity decoder-only pre-trained MLLMs while preserving multimodal capabilities. Additionally, mmMamba introduces flexible hybrid architectures that strategically combine Transformer and Mamba layers, enabling customizable trade-offs between computational efficiency and model performance.

Distilled from the decoder-only HoVLE-2.6B, our pure Mamba-2-based mmMamba-linear achieves performance competitive with existing linear and quadratic-complexity VLMs, including those with 2x larger parameter size like EVE-7B. The hybrid variant, mmMamba-hybrid, further enhances performance across all benchmarks, approaching the capabilities of the teacher model HoVLE. In long-context scenarios with 103K tokens, mmMamba-linear demonstrates remarkable efficiency gains with a 20.6× speedup and 75.8% GPU memory reduction compared to HoVLE, while mmMamba-hybrid achieves a 13.5× speedup and 60.2% memory savings.

Seeding strategy and three-stage distillation pipeline of mmMamba.

Getting Started

Acknowledgement

mmMamba is greatly inspired by the following outstanding contributions to the open-source community: mamba, LolCATs, phi-mamba, MambaInLlama, HoVLE, SOLO, flash-linear-attention.

Citation

If you find mmMamba is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

 @article{mmMamba,
  title={mmMamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation},
  author={Bencheng Liao and Hongyuan Tao and Qian Zhang and Tianheng Cheng and Yingyue Li and Haoran Yin and Wenyu Liu and Xinggang Wang},
  journal={arXiv preprint arXiv:2502.13145},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
assets		assets
configs		configs
docs		docs
eval		eval
fla		fla
hovle		hovle
internvl		internvl
scripts		scripts
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_weight.py		convert_weight.py
distill_mmMamba.py		distill_mmMamba.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mmMamba

Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

News

Table of Contents

Introduction

Getting Started

Acknowledgement

Citation

About

Releases

Packages

Contributors 2

Languages

License

hustvl/mmMamba

Folders and files

Latest commit

History

Repository files navigation

mmMamba

Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation

News

Table of Contents

Introduction

Getting Started

Acknowledgement

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages