Jian Hu, Zixu Cheng, Chenyang Si, Wei Li, Shaogang Gong
✨ Highlights:
(i) We are the first to approach long video understanding by optimising input video information to fully utilise the model’s ability to comprehend long videos.
(ii) We propose a training-free mosaicing binary coding together with pseudo temporal grounding for long video understanding.
(iii) We apply our CoS into three various baseline to demonstrate its effectiveness and adaptability.
conda create -n CoS python=3.10 -y && conda activate cos
pip install torch==2.1.2 torchvision --index-url https://download.pytorch.org/whl/cu118
pip install packaging && pip install ninja && pip install flash-attn --no-build-isolation --no-cache-dir
pip install -r requirements.txt
cd LongVA/
python -m pip install -e "longva/.[train]"
pip install transformers==4.46.3
pip install -q bitsandbytes==0.42.0 accelerate==0.26.0
cd lmms-eval
pip install -e .
For Video-MME, LongVideoBench, MLVU evaluation, please use lmms-eval
After installing lmms-eval
and CoS, you can use the following script to evaluate. note now our baseline is LongVA, you can extend our CoS to any baselines by modifying codes in lmms-eval folders.
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model longva_cos \
--model_args pretrained=lmms-lab/LongVA-7B,conv_template=qwen_1_5,model_name=llava_qwen,max_frames_num=128,video_decode_backend=decord\
--tasks videomme \
--batch_size 1 \
--log_samples \
--log_samples_suffix videoxl \
--output_path ./logs/
If you find this repository useful, please consider giving a star ⭐ and citation
@article{hu2025cos,
title={CoS: Chain-of-Shot Prompting for Long Video Understanding},
author={Hu, Jian and Cheng, Zixu and Si, Chenyang and Li, Wei and Gong, Shaogang},
journal={arXiv preprint arXiv:2502.06428},
year={2025}
}
- LongVA: the codebase we built upon.
- LMMs-Eval: the codebase we built for CoS and evaluation.
- Special thanks to Shu Yan for his generous and selfless help.
This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.