This repository contains the official code for the BMVC 2023 (Oral) paper Video-adverb retrieval with compositional adverb-action embeddings.
Install all required dependencies into a new virtual environment via conda.
conda env create -f environment.yaml
We provide pre-extracted S3D features of all datasets to download. We thank the authors of Action Changes for sharing them with us.
You can dowload the features of all datasets here:
The features should be (per default) placed inside a features
folder:
mkdir features
unzip video-adverb-datasets_s3d.zip -d features/
In case you store the features in a different location, you can set the path during execution:
python train.py feature_dir=PATH_TO_FEATURES [...]
To download videos and to extract S3D features of the datasets on your own, you can follow the instructions given here: https://github.com/dmoltisanti/air-cvpr23
Pseudo Adverbs proposed a VATEX Adverbs dataset split for evaluating unseen adverb-action compositions. We replicate this split for the S3D features used in this work by omitting unavailable videos and additionally propose new splits for evaluating the performance on unseen compositions for the ActivityNet Adverbs and MSR-VTT Adverbs datasets. The statistics for the dataset splits are given below:
VATEX Adv. | ActivityNet Adv. | MSR-VTT Adv. | |
---|---|---|---|
# train samples |
6603 | 1490 | 987 |
# unlabelled samples |
3317 | 634 | 306 |
# test samples |
3293 | 848 | 454 |
# pairs train |
319 | 635 | 225 |
# pairs unlabelled |
168 | 537 | 114 |
# pairs test |
316 | 543 | 225 |
You can find the corresponding files for the unseen compositions splits at splits/unseen_compositions/DATASET_NAME
. Each folder contains the following files:
antonyms.csv
: mapping file for mapping adverbs to their respective antonymtrain.csv
: list of samples / videos used for trainingtest.csv
: list of samples / videos used for testingunlabelled.csv
: list of samples / videos that can be used as unlabelled data (pseudo-labelling)
We additionally provide model checkpoints. The checkpoints can be downloaded here.
In the following, we provide commands for training our method ReGaDa for video-adverb retrieval. We provide commands for both the main experiments and for the experiment on unseen adverb-action compositions.
python train.py +run=main_howto100m # HowTo100M Adverbs
python train.py +run=main_air # Adverbs in Recipes
python train.py +run=main_activitynet # ActivityNet Adverbs
python train.py +run=main_msrvtt # MSR-VTT Adverbs
python train.py +run=main_vatex # VATEX Adverbs
python train.py +run=unseen_vatex # VATEX Adverbs
python train.py +run=unseen_activitynet # ActivityNet Adverbs
python train.py +run=unseen_msrvtt # MSR-VTT Adverbs
You can evaluate your trained model, or one of the model checkpoints provided, using the following command:
python test.py checkpoint=CHECKPOINT_PATH
where CHECKPOINT_PATH is the checkpoint directory of the model to evaluate.
Model | mAP W | mAP M | Acc-A |
---|---|---|---|
Priors | 0.446 | 0.354 | 0.786 |
S3D pre-trained | 0.339 | 0.238 | 0.560 |
TIRG | 0.441 | 0.476 | 0.721 |
Action Modifier | 0.406 | 0.372 | 0.796 |
ACCLS | 0.562 | 0.420 | 0.786 |
ACREG | 0.555 | 0.423 | 0.799 |
ReGaDa (ours) | 0.567 | 0.528 | 0.817 |
Model | mAP W | mAP M | Acc-A |
---|---|---|---|
Priors | 0.491 | 0.263 | 0.854 |
S3D pre-trained | 0.389 | 0.173 | 0.735 |
TIRG | 0.485 | 0.228 | 0.835 |
Action Modifier | 0.509 | 0.251 | 0.857 |
ACCLS | 0.606 | 0.289 | 0.841 |
ACREG | 0.613 | 0.244 | 0.847 |
ReGaDa (ours) | 0.704 | 0.418 | 0.874 |
Model | mAP W | mAP M | Acc-A |
---|---|---|---|
Priors | 0.217 | 0.159 | 0.745 |
S3D pre-trained | 0.118 | 0.070 | 0.560 |
TIRG | 0.186 | 0.111 | 0.709 |
Action Modifier | 0.184 | 0.125 | 0.753 |
ACCLS | 0.130 | 0.096 | 0.741 |
ACREG | 0.119 | 0.079 | 0.714 |
ReGaDa (ours) | 0.239 | 0.175 | 0.771 |
Model | mAP W | mAP M | Acc-A |
---|---|---|---|
Priors | 0.308 | 0.152 | 0.723 |
S3D pre-trained | 0.194 | 0.075 | 0.603 |
TIRG | 0.297 | 0.113 | 0.700 |
Action Modifier | 0.233 | 0.127 | 0.731 |
ACCLS | 0.305 | 0.131 | 0.751 |
ACREG | 0.282 | 0.114 | 0.774 |
ReGaDa (ours) | 0.378 | 0.228 | 0.786 |
Model | mAP W | mAP M | Acc-A |
---|---|---|---|
Priors | 0.216 | 0.086 | 0.752 |
S3D pre-trained | 0.122 | 0.038 | 0.586 |
TIRG | 0.195 | 0.065 | 0.735 |
Action Modifier | 0.139 | 0.059 | 0.751 |
ACCLS | 0.283 | 0.108 | 0.754 |
ACREG | 0.261 | 0.086 | 0.755 |
ReGaDa (ours) | 0.290 | 0.113 | 0.817 |
Model | VATEX Adv. | ActivityNet Adv. | MSR-VTT Adv. |
---|---|---|---|
CLIP | 54.5 | 55.1 | 57.0 |
Action Modifier | 53.8 | 57.0 | 56.0 |
ACCLS | 54.3 | 55.1 | 53.7 |
ACREG | 54.9 | 53.9 | 59.0 |
ReGaDa (ours) | 61.7 | 58.4 | 61.0 |
If you find this code useful, please consider citing:
@inproceedings{hummel2023regada,
author = {Hummel, Thomas and Mercea, Otniel-Bogdan and Koepke, A. Sophia and Akata, Zeynep},
title = {Video-adverb retrieval with compositional adverb-action embeddings},
booktitle = {BMVC},
year = {2023}
}
Small portions of the code are adapted from the following repositories: