Skip to content

Commit

Permalink
[Feature] support MViT (#2007)
Browse files Browse the repository at this point in the history
  • Loading branch information
cir7 authored Dec 1, 2022
1 parent 166f447 commit add1242
Show file tree
Hide file tree
Showing 23 changed files with 2,790 additions and 18 deletions.
14 changes: 14 additions & 0 deletions configs/_base_/models/mvit_small.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
model = dict(
type='Recognizer3D',
backbone=dict(type='MViT', arch='small', drop_path_rate=0.2),
data_preprocessor=dict(
type='ActionDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
format_shape='NCTHW'),
cls_head=dict(
type='MViTHead',
in_channels=768,
num_classes=400,
label_smooth_eps=0.1,
average_clips='prob'))
77 changes: 77 additions & 0 deletions configs/recognition/mvit/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# MViT V2

> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)
<!-- [ALGORITHM] -->

## Abstract

<!-- [ABSTRACT] -->

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video
classification, as well as object detection. We present an improved version of MViT that incorporates
decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture
in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where
it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where
it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art
performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as
well as 86.1% on Kinetics-400 video classification.

<!-- [IMAGE] -->

<div align=center>
<img src="https://user-images.githubusercontent.com/33249023/196627033-03a4e9b1-082e-42ee-a2a0-77f874fe632a.png" width="50%"/>
</div>

## Results and models

### Kinetics-400

| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | FLOPs | params | config | ckpt |
| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :-----------------------------: | :-----------------------------: | :--------------: | :---: | :----: | :-----------------: | :---------------: |
| 16x4x1 | short-side 320 | MViTv2-S\* | From scratch | 81.1 | 94.7 | [81.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 64G | 34.5M | [config](/configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth) |
| 32x3x1 | short-side 320 | MViTv2-B\* | From scratch | 82.6 | 95.8 | [82.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [95.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 225G | 51.2M | [config](/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth) |
| 40x3x1 | short-side 320 | MViTv2-L\* | From scratch | 85.4 | 96.2 | [86.1](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [97.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 3 crop | 2828G | 213M | [config](/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth) |

### Something-Something V2

| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | FLOPs | params | config | ckpt |
| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :----------------------------: | :-----------------------------: | :---------------: | :---: | :----: | :-----------------: | :---------------: |
| uniform 16 | short-side 320 | MViTv2-S\* | K400 | 68.1 | 91.0 | [68.2](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [91.4](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 64G | 34.4M | [config](/configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth) |
| uniform 32 | short-side 320 | MViTv2-B\* | K400 | 70.8 | 92.7 | [70.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [92.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 225G | 51.1M | [config](/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth) |
| uniform 40 | short-side 320 | MViTv2-L\* | IN21K + K400 | 73.2 | 94.0 | [73.3](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 2828G | 213M | [config](/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth) |

*Models with * are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data. Currently, we only support the testing of MViT models, training will be available soon.*

1. The values in columns named after "reference" are copied from paper
2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.

For more details on data preparation, you can refer to [Kinetics400](/tools/data/kinetics/README.md).

## Test

You can use the following command to test a model.

```shell
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
```

Example: test MViT model on Kinetics-400 dataset and dump the result to a pkl file.

```shell
python tools/test.py configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py \
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
```

For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).

## Citation

```bibtex
@inproceedings{li2021improved,
title={MViTv2: Improved multiscale vision transformers for classification and detection},
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
booktitle={CVPR},
year={2022}
}
```
115 changes: 115 additions & 0 deletions configs/recognition/mvit/metafile.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
Collections:
- Name: MViT
README: configs/recognition/MViT/README.md
Paper:
URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf
Title: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"

Models:
- Name: mvit-small-p244_16x4x1_kinetics400-rgb
Config: configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 81.1
Top 5 Accuracy: 94.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth

- Name: mvit-base-p244_32x3x1_kinetics400-rgb
Config: configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-base
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 81.1
Top 5 Accuracy: 94.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth

- Name: mvit-large-p244_40x3x1_kinetics400-rgb
Config: configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-large
Resolution: short-side 446
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 81.1
Top 5 Accuracy: 94.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth

- Name: mvit-small-p244_u16_sthv2-rgb
Config: configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: SthV2
Task: Action Recognition
Metrics:
Top 1 Accuracy: 68.1
Top 5 Accuracy: 91.0
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth

- Name: mvit-base-p244_u32_sthv2-rgb
Config: configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 320
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: SthV2
Task: Action Recognition
Metrics:
Top 1 Accuracy: 70.8
Top 5 Accuracy: 92.7
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth

- Name: mvit-large-p244_u40_sthv2-rgb
Config: configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py
In Collection: MViT
Metadata:
Architecture: MViT-small
Resolution: short-side 446
Modality: RGB
Converted From:
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
Code: https://github.com/facebookresearch/SlowFast/
Results:
- Dataset: SthV2
Task: Action Recognition
Metrics:
Top 1 Accuracy: 73.2
Top 5 Accuracy: 94.0
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth
150 changes: 150 additions & 0 deletions configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
_base_ = [
'../../_base_/models/mvit_small.py', '../../_base_/default_runtime.py'
]

model = dict(
backbone=dict(
arch='base',
temporal_size=32,
drop_path_rate=0.3,
),
data_preprocessor=dict(
type='ActionDataPreprocessor',
mean=[114.75, 114.75, 114.75],
std=[57.375, 57.375, 57.375],
blending=dict(
type='RandomBatchAugment',
augments=[
dict(type='MixupBlending', alpha=0.8, num_classes=400),
dict(type='CutmixBlending', alpha=1, num_classes=400)
]),
format_shape='NCTHW'),
)

# dataset settings
dataset_type = 'VideoDataset'
data_root = 'data/kinetics400/videos_train'
data_root_val = 'data/kinetics400/videos_val'
ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt'
ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'

file_client_args = dict(io_backend='disk')
train_pipeline = [
dict(type='DecordInit', **file_client_args),
dict(type='SampleFrames', clip_len=32, frame_interval=3, num_clips=1),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(
type='PytorchVideoWrapper',
op='RandAugment',
magnitude=7,
num_layers=4),
dict(type='RandomResizedCrop'),
dict(type='Resize', scale=(224, 224), keep_ratio=False),
dict(type='Flip', flip_ratio=0.5),
dict(type='RandomErasing', erase_prob=0.25, mode='rand'),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]
val_pipeline = [
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=3,
num_clips=1,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 256)),
dict(type='CenterCrop', crop_size=224),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]
test_pipeline = [
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=32,
frame_interval=3,
num_clips=5,
test_mode=True),
dict(type='DecordDecode'),
dict(type='Resize', scale=(-1, 224)),
dict(type='CenterCrop', crop_size=224),
dict(type='FormatShape', input_format='NCTHW'),
dict(type='PackActionInputs')
]

train_dataloader = dict(
batch_size=8,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type=dataset_type,
ann_file=ann_file_train,
data_prefix=dict(video=data_root),
pipeline=train_pipeline))
val_dataloader = dict(
batch_size=8,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_val,
data_prefix=dict(video=data_root_val),
pipeline=val_pipeline,
test_mode=True))
test_dataloader = dict(
batch_size=1,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type=dataset_type,
ann_file=ann_file_test,
data_prefix=dict(video=data_root_val),
pipeline=test_pipeline,
test_mode=True))

val_evaluator = dict(type='AccMetric')
test_evaluator = val_evaluator

train_cfg = dict(
type='EpochBasedTrainLoop', max_epochs=30, val_begin=1, val_interval=3)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')

optim_wrapper = dict(
type='AmpOptimWrapper',
optimizer=dict(
type='AdamW', lr=1.6e-3, betas=(0.9, 0.999), weight_decay=0.05))

param_scheduler = [
dict(
type='LinearLR',
start_factor=0.1,
by_epoch=True,
begin=0,
end=30,
convert_to_iter_based=True),
dict(
type='CosineAnnealingLR',
T_max=200,
eta_min=0,
by_epoch=True,
begin=0,
end=200,
convert_to_iter_based=True)
]

default_hooks = dict(
checkpoint=dict(interval=3, max_keep_ckpts=5), logger=dict(interval=100))

# Default setting for scaling LR automatically
# - `enable` means enable scaling LR automatically
# or not by default.
# - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=64)
Loading

0 comments on commit add1242

Please sign in to comment.