Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

This is the official implementation of BLiM (ICCV 2025 Highlight).

Dohwan Ko^1*, Ji Soo Lee^1*, Minhyuk Choi¹, Zihang Meng², Hyunwoo J. Kim³.

¹Korea University ²Meta GenAI ³KAIST

Setup

To install requirements, run:

git clone https://github.com/mlvlab/BLiM.git
cd BLiM
conda create -n blim python=3.11
conda activate blim
bash setup.sh

Dataset

Quick start

You can download the preprocessed video features and annotations for the following datasets: DiDeMo, ActivityNet, LSMDC, and MSRVTT here.

Long start

After downloading the videos for each dataset, use extract.py to extract video features.

CUDA_VISIBLE_DEVICES=0 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 0

CUDA_VISIBLE_DEVICES=1 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 1

CUDA_VISIBLE_DEVICES=2 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 2

CUDA_VISIBLE_DEVICES=3 python extract.py --dataset DiDeMo --batch_size 16 --num_chunk 4 --chunk_idx 3

This will launch feature extraction on 4 GPUs in parallel. To use a different number of GPUs, adjust the --num_chunk and --chunk_idx arguments accordingly.
You can also use extract.py to extract video features from your own dataset. Simply modify the --dataset argument and ensure the videos are organized in the expected format.

Directory structure: After preparing the video features and annotations, place them under the ./data/ directory as shown below:

./data
   └─ DiDeMo
       |─ features
       |   └─ ...
       |─ videos # (optional) only for long start 
       |   └─ ...
       |─ didemo_ret_train.json
       └─ didemo_ret_test.json
       
   └─ ActivityNet
       |─ features
       |   └─ ...
       |─ videos # (optional) only for long start
       |   └─ ...
       |─ anet_ret_train.json
       └─ anet_ret_val_1.json
       
   └─ LSMDC
    	:
   			
   └─ MSRVTT
    	:

InternVideo2 Score Preparation

We provide pre-extracted retrieval scores from InternVideo2 1B here.
After downloading, place the scores folder is located at the root of your project directory.

VideoChat-Flash Preparation

mkdir pretrained
cd pretrained
git lfs install
git clone https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448

Training BLiM

DiDeMo

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 4 --batch_size_eval 16 --epochs 5 --warmup_epochs 1 --dataset DiDeMo --topk 16 \
--lr 2e-4 --weight_decay 1e-0 --output_dir ./checkpoint/didemo/blim --accum_iter 1 --cpn --alpha 0.0 0.8 --c 0.9 0.2 0.9 0.9

ActivityNet

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 2 --batch_size_eval 16 --epochs 5 --warmup_epochs 1 --dataset ActivityNet --topk 16 \
--lr 2e-4 --weight_decay 1e-0 --output_dir ./checkpoint/activitynet/blim --accum_iter 2 --cpn --alpha 0.2 0.9 --c 1.0 0.4 0.9 0.8

LSMDC

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 4 --batch_size_eval 16 --epochs 3 --warmup_epochs 1 --dataset LSMDC --topk 16 \
--lr 1e-4 --weight_decay 1e-0 --output_dir ./checkpoint/lsmdc/blim --accum_iter 8 --cpn --alpha 0.2 1.0 --c 1.0 0.6 0.9 0.6

MSRVTT

torchrun --rdzv_endpoint 127.0.0.1:1234 --nproc_per_node 8 \
main.py --batch_size 4 --batch_size_eval 16 --epochs 3 --warmup_epochs 1 --dataset MSRVTT --topk 16 \
--lr 1e-4 --weight_decay 1e-0 --output_dir ./checkpoint/msrvtt/blim --accum_iter 16 --cpn --alpha 0.0 0.9 --c 1.0 0.6 0.8 0.4

You can download our fine-tuned checkpoints here. After downloading, place the checkpoint folder in your working directory.

Evaluation

To evaluate a fine-tuned model, run the training script with the following arguments:
```
--eval --resume ./your/checkpoint.pth
```

For zero-shot evaluation, simply add --eval. We recommend adjusting the --alpha and --c values, which control the weights for CPN and ensemble with InternVideo2, respectively.

# DiDeMo
--alpha 0.0 0.9 --c 1.0 0.0 0.9 0.9

# ActivityNet
--alpha 0.0 0.9 --c 1.0 0.0 0.9 0.8

# LSMDC
--alpha 0.0 0.9 --c 1.0 0.0 0.9 0.8

# MSRVTT
--alpha 0.0 0.8 --c 1.0 0.0 0.8 0.6

Acknowledgements

This repo is built upon Flipped-VQA.

Citations

@inproceedings{ko2025bidirectional,
  title={Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval},
  author={Ko, Dohwan and Lee, Ji Soo and Choi, Minhyuk and Meng, Zihang and Kim, Hyunwoo J},
  booktitle={ICCV},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
asset		asset
dataloader		dataloader
util		util
videochat_flash		videochat_flash
LICENSE		LICENSE
README.md		README.md
extract.py		extract.py
main.py		main.py
retrieval_utils.py		retrieval_utils.py
setup.sh		setup.sh
training_utils.py		training_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Setup

Dataset

Quick start

Long start

InternVideo2 Score Preparation

VideoChat-Flash Preparation

Training BLiM

DiDeMo

ActivityNet

LSMDC

MSRVTT

Evaluation

Acknowledgements

Citations

About

Uh oh!

Releases

Packages

Languages

License

mlvlab/BLiM

Folders and files

Latest commit

History

Repository files navigation

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Setup

Dataset

Quick start

Long start

InternVideo2 Score Preparation

VideoChat-Flash Preparation

Training BLiM

DiDeMo

ActivityNet

LSMDC

MSRVTT

Evaluation

Acknowledgements

Citations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages