Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support CLIP4Clip #2489

Merged
merged 56 commits into from
Jun 19, 2023
Merged

[Feature] Support CLIP4Clip #2489

merged 56 commits into from
Jun 19, 2023

Conversation

Dai-Wenxun
Copy link
Collaborator

@Dai-Wenxun Dai-Wenxun commented May 22, 2023

CLIP4Clip

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Abstract

Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.

Results and Models

MSRVTT-9k

frame sampling strategy resolution gpus backbone adapter pretrain Recall@1 Recall@5 Recall@10 MdR MnR testing protocol config ckpt log
uniform 12 224x224 8 ViT-B/32 Mean clip 43.1 69.4 78.9 2.0 16.8 1 clips x 1 crop config ckpt log

For more details on data preparation, you can refer to video_retrieval.

Train

You can use the following command to train a model.

python tools/train.py ${CONFIG_FILE} [optional arguments]

Example: train CLIP4Clip model on MSRVTT-9k dataset in a deterministic option with periodic validation.

python tools/train.py configs/retrieval/clip4clip/clip4clip_vit-base-p32-res224-clip-pre_8xb16-u12-5e_msrvtt-9k-rgb.py \
    --seed 0 --deterministic

For more details, you can refer to the Training part in the Training and Test Tutorial.

Test

You can use the following command to test a model.

python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]

Example: test CLIP4Clip model on MSRVTT-9k dataset and dump the result to a pkl file.

python tools/test.py configs/retrieval/clip4clip/clip4clip_vit-base-p32-res224-clip-pre_8xb16-u12-5e_msrvtt-9k-rgb.py \
    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl

For more details, you can refer to the Test part in the Training and Test Tutorial.

Citation

@article{luo2022clip4clip,
  title={CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning},
  author={Luo, Huaishao and Ji, Lei and Zhong, Ming and Chen, Yang and Lei, Wen and Duan, Nan and Li, Tianrui},
  journal={Neurocomputing},
  volume={508},
  pages={293--304},
  year={2022},
}

@Dai-Wenxun Dai-Wenxun added the WIP work in progress label May 22, 2023
@cir7 cir7 closed this Jun 13, 2023
@cir7 cir7 reopened this Jun 13, 2023
@cir7 cir7 self-requested a review June 13, 2023 09:54
@Dai-Wenxun Dai-Wenxun removed the WIP work in progress label Jun 13, 2023
@codecov
Copy link

codecov bot commented Jun 15, 2023

Codecov Report

Patch coverage: 94.09% and project coverage change: +0.38 🎉

Comparison is base (1dc3a9a) 77.23% compared to head (8df84ef) 77.62%.

❗ Current head 8df84ef differs from pull request most recent head e6992c6. Consider uploading reports for the commit e6992c6 to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           dev-1.x    #2489      +/-   ##
===========================================
+ Coverage    77.23%   77.62%   +0.38%     
===========================================
  Files          161      167       +6     
  Lines        13172    13440     +268     
  Branches      2266     2302      +36     
===========================================
+ Hits         10174    10433     +259     
- Misses        2449     2455       +6     
- Partials       549      552       +3     
Flag Coverage Δ
unittests 77.62% <94.09%> (+0.38%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
mmaction/testing/__init__.py 100.00% <ø> (ø)
mmaction/structures/action_data_sample.py 72.28% <83.33%> (+0.86%) ⬆️
mmaction/datasets/transforms/text_transforms.py 85.71% <85.71%> (ø)
mmaction/models/similarity/clip_similarity.py 87.09% <87.09%> (ø)
mmaction/evaluation/metrics/retrieval_metric.py 97.95% <97.95%> (ø)
mmaction/datasets/__init__.py 100.00% <100.00%> (ø)
mmaction/datasets/transforms/__init__.py 100.00% <100.00%> (ø)
mmaction/datasets/transforms/formatting.py 95.08% <100.00%> (+3.27%) ⬆️
mmaction/datasets/transforms/loading.py 81.31% <100.00%> (ø)
mmaction/datasets/video_text_dataset.py 100.00% <100.00%> (ø)
... and 6 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@cir7 cir7 merged commit 274c2ad into open-mmlab:dev-1.x Jun 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants