- [2025-01] 🎉 Our arXiv paper TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding is released!
- [2024-12] 🔊 Our TinyLLaVA-Video-v1 repository has been established.
This is a framework of Small-scale Large Multimodal Models for video understanding based on TinyLLaVA_Factory.
- The model with parameters not exceeding 4B that processes video sequences in a simple manner, without the need for complex architectures, supporting both fps sampling and uniform frame sampling.
- We validate the effectiveness of this framework through experiments, the best model achieving performance comparable to certain existing 7B models on multiple video understanding benchmarks.
- Clone this repository and navigate to the folder
git clone https://github.com/ZhangXJ199/TinyLLaVA-Video.git
cd TinyLLaVA-Video
- Create a conda environment, activate it and install Packages
conda create -n tinyllava_video python=3.10 -y
conda activate tinyllava_video
pip install --upgrade pip # enable PEP 660 support
pip install -e .
- Install additional packages
pip install flash-attn --no-build-isolation
git pull
pip install -e .
We combine partial data from two datasets: LLaVA-Video-178K and Valley.
Stage | Source | #Sample |
---|---|---|
Pretrain | LLaVA-Video-178K + Valley | 397k |
Finetune | LLaVA-Video-178K | 491k |
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1
, 30_60_s_academic_v0_1
, 0_30_s_youtube_v0_1
, and 30_60_s_youtube_v0_1
, supplemented with the filtered Video-LLaVA. The organized pretraining annotations can be downloaded from here.
We use four subsets of LLaVA-Video-178K: 0_30_s_academic_v0_1
, 30_60_s_academic_v0_1
, 0_30_s_youtube_v0_1
, and 30_60_s_youtube_v0_1
. The organized finetune annotations can be downloaded from here.
Organize the image files and annotation files as follows in path/to/your/dataset
:
dataset
├── academic_source
├── liwei_youtube_videos
├── valley
├── text_files
│ ├── cleaned_video_caption.json
│ ├── cleaned_video_openqa.json
You can refer to TinyLLaVA_Factory to modify components such as "llm," "vision_tower," and "train_recipe."
Here's an example for training a LMM using Qwen2.5.
- Replace data paths with yours in
scripts/train/qwen2/train_qwen2_base_video.sh
- Replace
output_dir
with yours inscripts/train/qwen2/pretrain_qwen2_video.sh
- Replace
pretrained_model_path
andoutput_dir
with yours inscripts/train/qwen2/finetune_qwen2_video.sh
- Adjust your GPU ids (localhost) and
per_device_train_batch_size
inscripts/train/qwen2/pretrain_qwen2_video.sh
andscripts/train/qwen2/finetune_qwen2_video.sh
bash scripts/train/qwen2/train_qwen2_base_video.sh
Important hyperparameters used in pretraining and finetuning are provided below.
Training Stage | Global Batch Size | Learning rate | conv_version |
---|---|---|---|
Pretraining | 128 | 1e-4 | pretrain |
Finetuning | 64 | 2e-5 | qwen2_base |
Tips:
Global Batch Size = num of GPUs * per_device_train_batch_size
* gradient_accumulation_steps
, we recommand you always keep global batch size and learning rate as above except for lora tuning your model.
We currently provide evaluations on 4 benchmarks, including Video-MME, MVBench, LongVideoBench, MLVU.
- Download Video-MME and put it under
path/to/your/dataset/eval/Video-MME
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
,conv-mode
andduration
inscripts/eval/videomme.sh
. There are three types ofduration
available for testing:short
,medium
, andlong
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/videomme.sh
- Download MVBench and put it under
path/to/your/dataset/eval/MVBench
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
andconv-mode
inscripts/eval/mvbench.sh
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mvbench.sh
- Download LongVideoBench and put it under
path/to/your/dataset/eval/LongVideoBench
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
andconv-mode
inscripts/eval/lvbench.sh
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/lvbench.sh
- Download MLVU and put it under
path/to/your/dataset/eval/MLVU
. - Please change
MODEL_PATH
,MODEL_NAME
,EVAL_DIR
andconv-mode
inscripts/eval/mlvu.sh
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 bash scripts/eval/mlvu.sh
Here, 16 represents sampling 16 frames, and 512 represents using 512 tokens(queries) to represent the video sequence.
VT (HF Path) | LLM (HF Path) | #Frame/Query | Video-MME | MVBench | LongVideoBench | MLVU |
---|---|---|---|---|---|---|
google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-3B | 16/512 | 44.7 | 42.5 | 37.6 | 48.1 |
google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-3B | 1fps/1024 | 44.6 | 40.4 | 35.3 | 45.9 |
google/siglip-so400m-patch14-384 | microsoft/phi-2 | 16/512 | 42.7 | 42.0 | 42.2 | 46.5 |
google/siglip-so400m-patch14-384 | Qwen/Qwen2.5-1.5B | 16/512 | 34.4 | 39.0 | 29.5 | 40.5 |
- Please change
model_path
,prompt
,video_file
andconv-mode
ineval.py
. - Please use the following command for single-gpu inference.
CUDA_VISIBLE_DEVICES=0 python eval.py
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝.
@article{zhang2025tinyllava,
title={TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal Models for Video Understanding},
author={Zhang, Xingjian and Weng, Xi and Yue, Yihao and Fan, Zhaoxin and Wu, Wenjun and Huang, Lei},
journal={arXiv preprint arXiv:2501.15513},
year={2025}
}
@article{jia2024tinyllava,
title={TinyLLaVA Factory: A Modularized Codebase for Small-scale Large Multimodal Models},
author={Jia, Junlong and Hu, Ying and Weng, Xi and Shi, Yiming and Li, Miao and Zhang, Xingjian and Zhou, Baichuan and Liu, Ziyu and Luo, Jie and Huang, Lei and Wu, Ji},
journal={arXiv preprint arXiv:2405.11788},
year={2024}
}
- This repository is based on TinyLLaVA_Factory project.
- Our codebase is built upon the LLaVA project. Great work!