This is the official implementation of BLiM (ICCV 2025 Highlight).
Dohwan Ko1*, Ji Soo Lee1*, Minhyuk Choi1, Zihang Meng2, Hyunwoo J. Kim3.
1Korea University 2Meta GenAI 3KAIST
To install requirements, run:
git clone https://github.com/mlvlab/BLiM.git
cd BLiM
conda create -n blim python=3.11
conda activate blim
bash setup.sh
- You can download the preprocessed video features and annotations for the following datasets: DiDeMo, ActivityNet, LSMDC, and MSRVTT here.
-
After downloading the videos for each dataset, use
extract.py
to extract video features.CUDA_VISIBLE_DEVICES=0 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 0 CUDA_VISIBLE_DEVICES=1 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 1 CUDA_VISIBLE_DEVICES=2 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 2 CUDA_VISIBLE_DEVICES=3 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 3
-
This will launch feature extraction on 4 GPUs in parallel. To use a different number of GPUs, adjust the
--num_chunk
and--chunk_idx
arguments accordingly. -
You can also use
extract.py
to extract video features from your own dataset. Simply modify the--dataset
argument and ensure the videos are organized in the expected format.
-
Directory structure: After preparing the video features and annotations, place them under the
./data/
directory as shown below:./data └─ DiDeMo |─ features | └─ ... |─ videos # (optional) only for long start | └─ ... |─ didemo_ret_train.json └─ didemo_ret_test.json └─ ActivityNet |─ features | └─ ... |─ videos # (optional) only for long start | └─ ... |─ anet_ret_train.json └─ anet_ret_val_1.json └─ LSMDC : └─ MSRVTT :
-
We provide pre-extracted retrieval scores from InternVideo2 1B here.
-
After downloading, place the
scores
folder is located at the root of your project directory.
mkdir pretrained
cd pretrained
git lfs install
git clone https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 4 --batch_size_eval 16 --epochs 5 --warmup_epochs 1 --dataset DiDeMo --topk 16 \
--lr 2e-4 --weight_decay 1e-0 --output_dir ./checkpoint/didemo/blim --accum_iter 1 --cpn --alpha 0.0 0.8 --c 0.9 0.2 0.9 0.9
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 2 --batch_size_eval 16 --epochs 5 --warmup_epochs 1 --dataset ActivityNet --topk 16 \
--lr 2e-4 --weight_decay 1e-0 --output_dir ./checkpoint/activitynet/blim --accum_iter 2 --cpn --alpha 0.2 0.9 --c 1.0 0.4 0.9 0.8
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 4 --batch_size_eval 16 --epochs 3 --warmup_epochs 1 --dataset LSMDC --topk 16 \
--lr 1e-4 --weight_decay 1e-0 --output_dir ./checkpoint/lsmdc/blim --accum_iter 8 --cpn --alpha 0.2 1.0 --c 1.0 0.6 0.9 0.6
torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 4 --batch_size_eval 16 --epochs 3 --warmup_epochs 1 --dataset MSRVTT --topk 16 \
--lr 1e-4 --weight_decay 1e-0 --output_dir ./checkpoint/msrvtt/blim --accum_iter 16 --cpn --alpha 0.0 0.9 --c 1.0 0.6 0.8 0.4
- You can download our fine-tuned checkpoints here. After downloading, place the
checkpoint
folder in your working directory.
-
To evaluate a fine-tuned model, run the training script with the following arguments:
--eval --resume ./your/checkpoint.pth
-
For zero-shot evaluation, simply add
--eval
. We recommend adjusting the--alpha
and--c
values, which control the weights for CPN and ensemble with InternVideo2, respectively.# DiDeMo --alpha 0.0 0.9 --c 1.0 0.0 0.9 0.9 # ActivityNet --alpha 0.0 0.9 --c 1.0 0.0 0.9 0.8 # LSMDC --alpha 0.0 0.9 --c 1.0 0.0 0.9 0.8 # MSRVTT --alpha 0.0 0.8 --c 1.0 0.0 0.8 0.6
This repo is built upon Flipped-VQA.
@inproceedings{ko2025bidirectional,
title={Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval},
author={Ko, Dohwan and Lee, Ji Soo and Choi, Minhyuk and Meng, Zihang and Kim, Hyunwoo J},
booktitle={ICCV},
year={2025}
}