-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support CLIP4Clip
#2489
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## dev-1.x #2489 +/- ##
===========================================
+ Coverage 77.23% 77.62% +0.38%
===========================================
Files 161 167 +6
Lines 13172 13440 +268
Branches 2266 2302 +36
===========================================
+ Hits 10174 10433 +259
- Misses 2449 2455 +6
- Partials 549 552 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
CLIP4Clip
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Abstract
Video-text retrieval plays an essential role in multi-modal research and has been widely used in many real-world web applications. The CLIP (Contrastive Language-Image Pre-training), an image-language pre-training model, has demonstrated the power of visual concepts learning from web collected image-text datasets. In this paper, we propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner. Several questions are investigated via empirical studies: 1) Whether image feature is enough for video-text retrieval? 2) How a post-pretraining on a large-scale video-text dataset based on the CLIP affect the performance? 3) What is the practical mechanism to model temporal dependency between video frames? And 4) The Hyper-parameters sensitivity of the model on video-text retrieval task. Extensive experimental results present that the CLIP4Clip model transferred from the CLIP can achieve SOTA results on various video-text retrieval datasets, including MSR-VTT, MSVC, LSMDC, ActivityNet, and DiDeMo.
Results and Models
MSRVTT-9k
For more details on data preparation, you can refer to video_retrieval.
Train
You can use the following command to train a model.
python tools/train.py ${CONFIG_FILE} [optional arguments]
Example: train CLIP4Clip model on MSRVTT-9k dataset in a deterministic option with periodic validation.
For more details, you can refer to the Training part in the Training and Test Tutorial.
Test
You can use the following command to test a model.
Example: test CLIP4Clip model on MSRVTT-9k dataset and dump the result to a pkl file.
For more details, you can refer to the Test part in the Training and Test Tutorial.
Citation