-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
23 changed files
with
2,790 additions
and
18 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
model = dict( | ||
type='Recognizer3D', | ||
backbone=dict(type='MViT', arch='small', drop_path_rate=0.2), | ||
data_preprocessor=dict( | ||
type='ActionDataPreprocessor', | ||
mean=[123.675, 116.28, 103.53], | ||
std=[58.395, 57.12, 57.375], | ||
format_shape='NCTHW'), | ||
cls_head=dict( | ||
type='MViTHead', | ||
in_channels=768, | ||
num_classes=400, | ||
label_smooth_eps=0.1, | ||
average_clips='prob')) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# MViT V2 | ||
|
||
> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf) | ||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
<!-- [ABSTRACT] --> | ||
|
||
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video | ||
classification, as well as object detection. We present an improved version of MViT that incorporates | ||
decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture | ||
in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where | ||
it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where | ||
it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art | ||
performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as | ||
well as 86.1% on Kinetics-400 video classification. | ||
|
||
<!-- [IMAGE] --> | ||
|
||
<div align=center> | ||
<img src="https://user-images.githubusercontent.com/33249023/196627033-03a4e9b1-082e-42ee-a2a0-77f874fe632a.png" width="50%"/> | ||
</div> | ||
|
||
## Results and models | ||
|
||
### Kinetics-400 | ||
|
||
| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | FLOPs | params | config | ckpt | | ||
| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :-----------------------------: | :-----------------------------: | :--------------: | :---: | :----: | :-----------------: | :---------------: | | ||
| 16x4x1 | short-side 320 | MViTv2-S\* | From scratch | 81.1 | 94.7 | [81.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 64G | 34.5M | [config](/configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth) | | ||
| 32x3x1 | short-side 320 | MViTv2-B\* | From scratch | 82.6 | 95.8 | [82.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [95.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 225G | 51.2M | [config](/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth) | | ||
| 40x3x1 | short-side 320 | MViTv2-L\* | From scratch | 85.4 | 96.2 | [86.1](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [97.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 3 crop | 2828G | 213M | [config](/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth) | | ||
|
||
### Something-Something V2 | ||
|
||
| frame sampling strategy | resolution | backbone | pretrain | top1 acc | top5 acc | reference top1 acc | reference top1 acc | testing protocol | FLOPs | params | config | ckpt | | ||
| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :----------------------------: | :-----------------------------: | :---------------: | :---: | :----: | :-----------------: | :---------------: | | ||
| uniform 16 | short-side 320 | MViTv2-S\* | K400 | 68.1 | 91.0 | [68.2](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [91.4](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 64G | 34.4M | [config](/configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth) | | ||
| uniform 32 | short-side 320 | MViTv2-B\* | K400 | 70.8 | 92.7 | [70.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [92.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 225G | 51.1M | [config](/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth) | | ||
| uniform 40 | short-side 320 | MViTv2-L\* | IN21K + K400 | 73.2 | 94.0 | [73.3](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 2828G | 213M | [config](/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth) | | ||
|
||
*Models with * are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data. Currently, we only support the testing of MViT models, training will be available soon.* | ||
|
||
1. The values in columns named after "reference" are copied from paper | ||
2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available. | ||
|
||
For more details on data preparation, you can refer to [Kinetics400](/tools/data/kinetics/README.md). | ||
|
||
## Test | ||
|
||
You can use the following command to test a model. | ||
|
||
```shell | ||
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments] | ||
``` | ||
|
||
Example: test MViT model on Kinetics-400 dataset and dump the result to a pkl file. | ||
|
||
```shell | ||
python tools/test.py configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py \ | ||
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl | ||
``` | ||
|
||
For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md). | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@inproceedings{li2021improved, | ||
title={MViTv2: Improved multiscale vision transformers for classification and detection}, | ||
author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph}, | ||
booktitle={CVPR}, | ||
year={2022} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
Collections: | ||
- Name: MViT | ||
README: configs/recognition/MViT/README.md | ||
Paper: | ||
URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf | ||
Title: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection" | ||
|
||
Models: | ||
- Name: mvit-small-p244_16x4x1_kinetics400-rgb | ||
Config: configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py | ||
In Collection: MViT | ||
Metadata: | ||
Architecture: MViT-small | ||
Resolution: short-side 320 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md | ||
Code: https://github.com/facebookresearch/SlowFast/ | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 81.1 | ||
Top 5 Accuracy: 94.7 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth | ||
|
||
- Name: mvit-base-p244_32x3x1_kinetics400-rgb | ||
Config: configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py | ||
In Collection: MViT | ||
Metadata: | ||
Architecture: MViT-base | ||
Resolution: short-side 320 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md | ||
Code: https://github.com/facebookresearch/SlowFast/ | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 81.1 | ||
Top 5 Accuracy: 94.7 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth | ||
|
||
- Name: mvit-large-p244_40x3x1_kinetics400-rgb | ||
Config: configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py | ||
In Collection: MViT | ||
Metadata: | ||
Architecture: MViT-large | ||
Resolution: short-side 446 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md | ||
Code: https://github.com/facebookresearch/SlowFast/ | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 81.1 | ||
Top 5 Accuracy: 94.7 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth | ||
|
||
- Name: mvit-small-p244_u16_sthv2-rgb | ||
Config: configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py | ||
In Collection: MViT | ||
Metadata: | ||
Architecture: MViT-small | ||
Resolution: short-side 320 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md | ||
Code: https://github.com/facebookresearch/SlowFast/ | ||
Results: | ||
- Dataset: SthV2 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 68.1 | ||
Top 5 Accuracy: 91.0 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth | ||
|
||
- Name: mvit-base-p244_u32_sthv2-rgb | ||
Config: configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py | ||
In Collection: MViT | ||
Metadata: | ||
Architecture: MViT-small | ||
Resolution: short-side 320 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md | ||
Code: https://github.com/facebookresearch/SlowFast/ | ||
Results: | ||
- Dataset: SthV2 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 70.8 | ||
Top 5 Accuracy: 92.7 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth | ||
|
||
- Name: mvit-large-p244_u40_sthv2-rgb | ||
Config: configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py | ||
In Collection: MViT | ||
Metadata: | ||
Architecture: MViT-small | ||
Resolution: short-side 446 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md | ||
Code: https://github.com/facebookresearch/SlowFast/ | ||
Results: | ||
- Dataset: SthV2 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 73.2 | ||
Top 5 Accuracy: 94.0 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth |
150 changes: 150 additions & 0 deletions
150
configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,150 @@ | ||
_base_ = [ | ||
'../../_base_/models/mvit_small.py', '../../_base_/default_runtime.py' | ||
] | ||
|
||
model = dict( | ||
backbone=dict( | ||
arch='base', | ||
temporal_size=32, | ||
drop_path_rate=0.3, | ||
), | ||
data_preprocessor=dict( | ||
type='ActionDataPreprocessor', | ||
mean=[114.75, 114.75, 114.75], | ||
std=[57.375, 57.375, 57.375], | ||
blending=dict( | ||
type='RandomBatchAugment', | ||
augments=[ | ||
dict(type='MixupBlending', alpha=0.8, num_classes=400), | ||
dict(type='CutmixBlending', alpha=1, num_classes=400) | ||
]), | ||
format_shape='NCTHW'), | ||
) | ||
|
||
# dataset settings | ||
dataset_type = 'VideoDataset' | ||
data_root = 'data/kinetics400/videos_train' | ||
data_root_val = 'data/kinetics400/videos_val' | ||
ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt' | ||
ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt' | ||
ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt' | ||
|
||
file_client_args = dict(io_backend='disk') | ||
train_pipeline = [ | ||
dict(type='DecordInit', **file_client_args), | ||
dict(type='SampleFrames', clip_len=32, frame_interval=3, num_clips=1), | ||
dict(type='DecordDecode'), | ||
dict(type='Resize', scale=(-1, 256)), | ||
dict( | ||
type='PytorchVideoWrapper', | ||
op='RandAugment', | ||
magnitude=7, | ||
num_layers=4), | ||
dict(type='RandomResizedCrop'), | ||
dict(type='Resize', scale=(224, 224), keep_ratio=False), | ||
dict(type='Flip', flip_ratio=0.5), | ||
dict(type='RandomErasing', erase_prob=0.25, mode='rand'), | ||
dict(type='FormatShape', input_format='NCTHW'), | ||
dict(type='PackActionInputs') | ||
] | ||
val_pipeline = [ | ||
dict(type='DecordInit', **file_client_args), | ||
dict( | ||
type='SampleFrames', | ||
clip_len=32, | ||
frame_interval=3, | ||
num_clips=1, | ||
test_mode=True), | ||
dict(type='DecordDecode'), | ||
dict(type='Resize', scale=(-1, 256)), | ||
dict(type='CenterCrop', crop_size=224), | ||
dict(type='FormatShape', input_format='NCTHW'), | ||
dict(type='PackActionInputs') | ||
] | ||
test_pipeline = [ | ||
dict(type='DecordInit', **file_client_args), | ||
dict( | ||
type='SampleFrames', | ||
clip_len=32, | ||
frame_interval=3, | ||
num_clips=5, | ||
test_mode=True), | ||
dict(type='DecordDecode'), | ||
dict(type='Resize', scale=(-1, 224)), | ||
dict(type='CenterCrop', crop_size=224), | ||
dict(type='FormatShape', input_format='NCTHW'), | ||
dict(type='PackActionInputs') | ||
] | ||
|
||
train_dataloader = dict( | ||
batch_size=8, | ||
num_workers=8, | ||
persistent_workers=True, | ||
sampler=dict(type='DefaultSampler', shuffle=True), | ||
dataset=dict( | ||
type=dataset_type, | ||
ann_file=ann_file_train, | ||
data_prefix=dict(video=data_root), | ||
pipeline=train_pipeline)) | ||
val_dataloader = dict( | ||
batch_size=8, | ||
num_workers=8, | ||
persistent_workers=True, | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
dataset=dict( | ||
type=dataset_type, | ||
ann_file=ann_file_val, | ||
data_prefix=dict(video=data_root_val), | ||
pipeline=val_pipeline, | ||
test_mode=True)) | ||
test_dataloader = dict( | ||
batch_size=1, | ||
num_workers=8, | ||
persistent_workers=True, | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
dataset=dict( | ||
type=dataset_type, | ||
ann_file=ann_file_test, | ||
data_prefix=dict(video=data_root_val), | ||
pipeline=test_pipeline, | ||
test_mode=True)) | ||
|
||
val_evaluator = dict(type='AccMetric') | ||
test_evaluator = val_evaluator | ||
|
||
train_cfg = dict( | ||
type='EpochBasedTrainLoop', max_epochs=30, val_begin=1, val_interval=3) | ||
val_cfg = dict(type='ValLoop') | ||
test_cfg = dict(type='TestLoop') | ||
|
||
optim_wrapper = dict( | ||
type='AmpOptimWrapper', | ||
optimizer=dict( | ||
type='AdamW', lr=1.6e-3, betas=(0.9, 0.999), weight_decay=0.05)) | ||
|
||
param_scheduler = [ | ||
dict( | ||
type='LinearLR', | ||
start_factor=0.1, | ||
by_epoch=True, | ||
begin=0, | ||
end=30, | ||
convert_to_iter_based=True), | ||
dict( | ||
type='CosineAnnealingLR', | ||
T_max=200, | ||
eta_min=0, | ||
by_epoch=True, | ||
begin=0, | ||
end=200, | ||
convert_to_iter_based=True) | ||
] | ||
|
||
default_hooks = dict( | ||
checkpoint=dict(interval=3, max_keep_ckpts=5), logger=dict(interval=100)) | ||
|
||
# Default setting for scaling LR automatically | ||
# - `enable` means enable scaling LR automatically | ||
# or not by default. | ||
# - `base_batch_size` = (8 GPUs) x (8 samples per GPU). | ||
auto_scale_lr = dict(enable=False, base_batch_size=64) |
Oops, something went wrong.