TinyLLaVA-Video

🎉 News

[2025-01] 🎉 Our arXiv paper TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding is released!
[2024-12] 🔊 Our TinyLLaVA-Video-v1 repository has been established.

📌 About

This is a framework of Small-scale Large Multimodal Models for video understanding based on TinyLLaVA_Factory.

The model with parameters not exceeding 4B that processes video sequences in a simple manner, without the need for complex architectures, supporting both fps sampling and uniform frame sampling.
We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models on multiple video understanding benchmarks.

Installation and Requirements

Clone this repository and navigate to the folder

git clone https://github.com/ZhangXJ199/TinyLLaVA-Video.git
cd TinyLLaVA-Video

Create a conda environment, activate it and install Packages

conda create -n tinyllava_video python=3.10 -y
conda activate tinyllava_video
pip install --upgrade pip  # enable PEP 660 support
pip install -e .

Install additional packages

pip install flash-attn --no-build-isolation

Upgrade to the latest code base

git pull
pip install -e .

Get Started

1. Data Preparation

We combine partial data from two datasets: LLaVA-Video-178K and Valley.

Stage	Source	#Sample
Pretrain	LLaVA-Video-178K + Valley	397k
Finetune	LLaVA-Video-178K	491k

Pretrain Data

We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1, 30_60_s_academic_v0_1, 0_30_s_youtube_v0_1, and 30_60_s_youtube_v0_1, supplemented with the filtered Video-LLaVA. The organized pretraining annotations can be downloaded from here.

Finetune Data

We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1, 30_60_s_academic_v0_1, 0_30_s_youtube_v0_1, and 30_60_s_youtube_v0_1. The organized finetune annotations can be downloaded from here.

Organize Data

Organize the image files and annotation files as follows in path/to/your/dataset:

dataset
├── academic_source
├── liwei_youtube_videos
├── valley
├── text_files
│   ├── cleaned_video_caption.json
│   ├── cleaned_video_openqa.json

2. Train

You can refer to TinyLLaVA_Factory to modify components such as "llm," "vision_tower," and "train_recipe."

Here's an example for training a LMM using Qwen2.5.

Replace data paths with yours in scripts/train/qwen2/train_qwen2_base_video.sh
Replace output_dir with yours in scripts/train/qwen2/pretrain_qwen2_video.sh
Replace pretrained_model_path and output_dir with yours in scripts/train/qwen2/finetune_qwen2_video.sh
Adjust your GPU ids (localhost) and per_device_train_batch_size in scripts/train/qwen2/pretrain_qwen2_video.sh and scripts/train/qwen2/finetune_qwen2_video.sh

bash scripts/train/qwen2/train_qwen2_base_video.sh

Important hyperparameters used in pretraining and finetuning are provided below.

Training Stage	Global Batch Size	Learning rate	conv_version
Pretraining	128	1e-4	pretrain
Finetuning	64	2e-5	qwen2_base

Tips:

Global Batch Size = num of GPUs * per_device_train_batch_size * gradient_accumulation_steps, we recommand you always keep global batch size and learning rate as above except for lora tuning your model.

3. Evaluation

We currently provide evaluations on 4 benchmarks, including Video-MME, MVBench, LongVideoBench, MLVU.

Video-MME

Download Video-MME and put it under path/to/your/dataset/eval/Video-MME.
Please change MODEL_PATH, MODEL_NAME, EVAL_DIR, conv-mode and duration in scripts/eval/videomme.sh. There are three types of duration available for testing: short, medium, and long.
Please use the following command for single-gpu inference.
```
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/videomme.sh
```

MVBench

Download MVBench and put it under path/to/your/dataset/eval/MVBench.
Please change MODEL_PATH, MODEL_NAME, EVAL_DIR and conv-mode in scripts/eval/mvbench.sh.
Please use the following command for single-gpu inference.
```
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mvbench.sh
```

LongVideoBench

Download LongVideoBench and put it under path/to/your/dataset/eval/LongVideoBench.
Please change MODEL_PATH, MODEL_NAME, EVAL_DIR and conv-mode in scripts/eval/lvbench.sh.
Please use the following command for single-gpu inference.
```
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/lvbench.sh
```

MLVU

Download MLVU and put it under path/to/your/dataset/eval/MLVU.
Please change MODEL_PATH, MODEL_NAME, EVAL_DIR and conv-mode in scripts/eval/mlvu.sh.
Please use the following command for single-gpu inference.
```
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mlvu.sh
```

Model Zoo

Trained Models

TinyLLaVA-Video-Phi2-16-512
TinyLLaVA-Video-Qwen2.5-3B-16-512

Here, 16 represents sampling 16 frames, and 512 represents using 512 tokens(queries) to represent the video sequence.

Model Performance

VT (HF Path)	LLM (HF Path)	#Frame/Query	Video-MME	MVBench	LongVideoBench	MLVU
google/siglip-so400m-patch14-384	Qwen/Qwen2.5-3B	16/512	44.7	42.5	37.6	48.1
google/siglip-so400m-patch14-384	Qwen/Qwen2.5-3B	1fps/1024	44.6	40.4	35.3	45.9
google/siglip-so400m-patch14-384	microsoft/phi-2	16/512	42.7	42.0	42.2	46.5
google/siglip-so400m-patch14-384	Qwen/Qwen2.5-1.5B	16/512	34.4	39.0	29.5	40.5

Quick Inference Scripts

Please change model_path, prompt, video_file and conv-mode in eval.py.
Please use the following command for single-gpu inference.

CUDA_VISIBLE_DEVICES=0 python eval.py

✏ Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.

@article{zhang2025tinyllava,
  title={TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding},
  author={Zhang, Xingjian and Weng, Xi and Yue, Yihao and Fan, Zhaoxin and Wu, Wenjun and Huang, Lei},
  journal={arXiv preprint arXiv:2501.15513},
  year={2025}
}

@article{jia2024tinyllava,
  title={TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models},
  author={Jia, Junlong and Hu, Ying and Weng, Xi and Shi, Yiming and Li, Miao and Zhang, Xingjian and Zhou, Baichuan and Liu, Ziyu and Luo, Jie and Huang, Lei and Wu, Ji},
  journal={arXiv preprint arXiv:2405.11788},
  year={2024}
}

❤️ Community efforts

This repository is based on TinyLLaVA_Factory project.
Our codebase is built upon the LLaVA project. Great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

TinyLLaVA-Video

🎉 News

📌 About

Installation and Requirements

Upgrade to the latest code base

Get Started

1. Data Preparation

Pretrain Data

Finetune Data

Organize Data

2. Train

3. Evaluation

Video-MME

MVBench

LongVideoBench

MLVU

Model Zoo

Trained Models

Model Performance

Quick Inference Scripts

✏ Citation

❤️ Community efforts

Files

README.md

Latest commit

History

README.md

File metadata and controls

TinyLLaVA-Video

🎉 News

📌 About

Installation and Requirements

Upgrade to the latest code base

Get Started

1. Data Preparation

Pretrain Data

Finetune Data

Organize Data

2. Train

3. Evaluation

Video-MME

MVBench

LongVideoBench

MLVU

Model Zoo

Trained Models

Model Performance

Quick Inference Scripts

✏ Citation

❤️ Community efforts