Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] support MViT #2007

Merged
merged 8 commits into from
Dec 1, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions configs/_base_/models/mvit_small.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
model = dict(
type='Recognizer3D',
backbone=dict(type='MViT', arch='small', drop_path_rate=0.2),
data_preprocessor=dict(
type='ActionDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
format_shape='NCTHW'),
cls_head=dict(
type='MViTHead',
in_channels=768,
num_classes=400,
label_smooth_eps=0.1,
average_clips='prob'))
77 changes: 77 additions & 0 deletions configs/recognition/mvit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# MViT V2

> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)

<!-- [ALGORITHM] -->

## Abstract

<!-- [ABSTRACT] -->

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video
classification, as well as object detection. We present an improved version of MViT that incorporates
decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture
in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where
it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where
it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art
performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as
well as 86.1% on Kinetics-400 video classification.

<!-- [IMAGE] -->

<div align=center>
<img src="https://user-images.githubusercontent.com/33249023/196627033-03a4e9b1-082e-42ee-a2a0-77f874fe632a.png" width="50%"/>
</div>

## Results and models

### Kinetics-400

| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | FLOPs | params | config | ckpt |
| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :-----------------------------: | :-----------------------------: | :--------------: | :---: | :----: | :-----------------: | :---------------: |
| 16x4x1 | short-side 320 | MViTv2-S\* | From scratch | 81.1 | 94.7 | [81.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 64G | 34.5M | [config](/configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth) |
| 32x3x1 | short-side 320 | MViTv2-B\* | From scratch | 82.6 | 95.8 | [82.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [95.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 225G | 51.2M | [config](/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth) |
| 40x3x1 | short-side 320 | MViTv2-L\* | From scratch | 85.4 | 96.2 | [86.1](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [97.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 3 crop | 2828G | 213M | [config](/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth) |

### Something-Something V2

| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | FLOPs | params | config | ckpt |
| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :----------------------------: | :-----------------------------: | :---------------: | :---: | :----: | :-----------------: | :---------------: |
| uniform 16 | short-side 320 | MViTv2-S\* | K400 | 68.1 | 91.0 | [68.2](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [91.4](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 64G | 34.4M | [config](/configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth) |
| uniform 32 | short-side 320 | MViTv2-B\* | K400 | 70.8 | 92.7 | [70.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [92.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 225G | 51.1M | [config](/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth) |
| uniform 40 | short-side 320 | MViTv2-L\* | IN21K + K400 | 73.2 | 94.0 | [73.3](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 2828G | 213M | [config](/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth) |

*Models with * are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data. Currently, we only support the testing of MViT models, training will be available soon.*

1. The values in columns named after "reference" are copied from paper
2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.

For more details on data preparation, you can refer to [Kinetics400](/tools/data/kinetics/README.md).

## Test

You can use the following command to test a model.

```shell
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```

Example: test MViT model on Kinetics-400 dataset and dump the result to a pkl file.

```shell
python tools/test.py configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py \
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
```

For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).

## Citation

```bibtex
@inproceedings{li2021improved,
title={MViTv2: Improved multiscale vision transformers for classification and detection},
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
booktitle={CVPR},
year={2022}
}
```
115 changes: 115 additions & 0 deletions configs/recognition/mvit/metafile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
Collections:
- Name: MViT
README: configs/recognition/MViT/README.md
Paper:
URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf
Title: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"

Models:
- Name: mvit-small-p244_16x4x1_kinetics400-rgb
Config: configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 81.1
Top 5 Accuracy: 94.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth

- Name: mvit-base-p244_32x3x1_kinetics400-rgb
Config: configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-base
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 81.1
Top 5 Accuracy: 94.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth

- Name: mvit-large-p244_40x3x1_kinetics400-rgb
Config: configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-large
Resolution: short-side 446
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 81.1
Top 5 Accuracy: 94.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth

- Name: mvit-small-p244_u16_sthv2-rgb
Config: configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: SthV2
Task: Action Recognition
Metrics:
Top 1 Accuracy: 68.1
Top 5 Accuracy: 91.0
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth

- Name: mvit-base-p244_u32_sthv2-rgb
Config: configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: SthV2
Task: Action Recognition
Metrics:
Top 1 Accuracy: 70.8
Top 5 Accuracy: 92.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth

- Name: mvit-large-p244_u40_sthv2-rgb
Config: configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 446
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: SthV2
Task: Action Recognition
Metrics:
Top 1 Accuracy: 73.2
Top 5 Accuracy: 94.0
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth
150 changes: 150 additions & 0 deletions configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
_base_ = [
'../../_base_/models/mvit_small.py', '../../_base_/default_runtime.py'
]

model = dict(
backbone=dict(
arch='base',
temporal_size=32,
drop_path_rate=0.3,
),
data_preprocessor=dict(
type='ActionDataPreprocessor',
mean=[114.75, 114.75, 114.75],
std=[57.375, 57.375, 57.375],
blending=dict(
type='RandomBatchAugment',
augments=[
dict(type='MixupBlending', alpha=0.8, num_classes=400),
dict(type='CutmixBlending', alpha=1, num_classes=400)
]),
format_shape='NCTHW'),
)

# dataset settings
dataset_type = 'VideoDataset'
data_root = 'data/kinetics400/videos_train'
data_root_val = 'data/kinetics400/videos_val'
ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt'
ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'

file_client_args = dict(io_backend='disk')
train_pipeline = [
dict(type='DecordInit', **file_client_args),
dict(type='SampleFrames', clip_len=32, frame_interval=3, num_clips=1),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(
type='PytorchVideoWrapper',
op='RandAugment',
magnitude=7,
num_layers=4),
dict(type='RandomResizedCrop'),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(type='RandomErasing', erase_prob=0.25, mode='rand'),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]
val_pipeline = [
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=3,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]
test_pipeline = [
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=3,
num_clips=5,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='CenterCrop', crop_size=224),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]

train_dataloader = dict(
batch_size=8,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=dict(video=data_root),
pipeline=train_pipeline))
val_dataloader = dict(
batch_size=8,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=dict(video=data_root_val),
pipeline=val_pipeline,
test_mode=True))
test_dataloader = dict(
batch_size=1,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=dict(video=data_root_val),
pipeline=test_pipeline,
test_mode=True))

val_evaluator = dict(type='AccMetric')
test_evaluator = val_evaluator

train_cfg = dict(
type='EpochBasedTrainLoop', max_epochs=30, val_begin=1, val_interval=3)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

optim_wrapper = dict(
type='AmpOptimWrapper',
optimizer=dict(
type='AdamW', lr=1.6e-3, betas=(0.9, 0.999), weight_decay=0.05))

param_scheduler = [
dict(
type='LinearLR',
start_factor=0.1,
by_epoch=True,
begin=0,
end=30,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
T_max=200,
eta_min=0,
by_epoch=True,
begin=0,
end=200,
convert_to_iter_based=True)
]

default_hooks = dict(
checkpoint=dict(interval=3, max_keep_ckpts=5), logger=dict(interval=100))

# Default setting for scaling LR automatically
# - `enable` means enable scaling LR automatically
# or not by default.
# - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=64)
Loading