[Feature] support MViT (#2007)

open-mmlab · Dec 1, 2022 · add1242 · add1242
1 parent 166f447
commit add1242
Show file tree

Hide file tree

Showing 23 changed files with 2,790 additions and 18 deletions.
diff --git a/configs/_base_/models/mvit_small.py b/configs/_base_/models/mvit_small.py
@@ -0,0 +1,14 @@
+model = dict(
+    type='Recognizer3D',
+    backbone=dict(type='MViT', arch='small', drop_path_rate=0.2),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[123.675, 116.28, 103.53],
+        std=[58.395, 57.12, 57.375],
+        format_shape='NCTHW'),
+    cls_head=dict(
+        type='MViTHead',
+        in_channels=768,
+        num_classes=400,
+        label_smooth_eps=0.1,
+        average_clips='prob'))
diff --git a/configs/recognition/mvit/README.md b/configs/recognition/mvit/README.md
@@ -0,0 +1,77 @@
+# MViT V2
+
+> [MViTv2: Improved Multiscale Vision Transformers for Classification and Detection](http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf)
+
+<!-- [ALGORITHM] -->
+
+## Abstract
+
+<!-- [ABSTRACT] -->
+
+In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video
+classification, as well as object detection. We present an improved version of MViT that incorporates
+decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture
+in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where
+it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where
+it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art
+performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as
+well as 86.1% on Kinetics-400 video classification.
+
+<!-- [IMAGE] -->
+
+<div align=center>
+<img src="https://user-images.githubusercontent.com/33249023/196627033-03a4e9b1-082e-42ee-a2a0-77f874fe632a.png" width="50%"/>
+</div>
+
+## Results and models
+
+### Kinetics-400
+
+| frame sampling strategy |   resolution   |  backbone  |   pretrain   | top1 acc | top5 acc |       reference top1 acc        |       reference top1 acc        | testing protocol | FLOPs | params |       config        |       ckpt        |
+| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :-----------------------------: | :-----------------------------: | :--------------: | :---: | :----: | :-----------------: | :---------------: |
+|         16x4x1          | short-side 320 | MViTv2-S\* | From scratch |   81.1   |   94.7   | [81.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.6](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop |  64G  | 34.5M  | [config](/configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth) |
+|         32x3x1          | short-side 320 | MViTv2-B\* | From scratch |   82.6   |   95.8   | [82.9](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [95.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 1 crop | 225G  | 51.2M  | [config](/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth) |
+|         40x3x1          | short-side 320 | MViTv2-L\* | From scratch |   85.4   |   96.2   | [86.1](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [97.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 5 clips x 3 crop | 2828G |  213M  | [config](/configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth) |
+
+### Something-Something V2
+
+| frame sampling strategy |   resolution   |  backbone  |   pretrain   | top1 acc | top5 acc |       reference top1 acc       |       reference top1 acc        | testing protocol  | FLOPs | params |       config        |       ckpt        |
+| :---------------------: | :------------: | :--------: | :----------: | :------: | :------: | :----------------------------: | :-----------------------------: | :---------------: | :---: | :----: | :-----------------: | :---------------: |
+|       uniform 16        | short-side 320 | MViTv2-S\* |     K400     |   68.1   |   91.0   | [68.2](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [91.4](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops |  64G  | 34.4M  | [config](/configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth) |
+|       uniform 32        | short-side 320 | MViTv2-B\* |     K400     |   70.8   |   92.7   | [70.5](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [92.7](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 225G  | 51.1M  | [config](/configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth) |
+|       uniform 40        | short-side 320 | MViTv2-L\* | IN21K + K400 |   73.2   |   94.0   | [73.3](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | [94.0](https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md) | 1 clips x 3 crops | 2828G |  213M  | [config](/configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth) |
+
+*Models with * are ported from the repo [SlowFast](https://github.com/facebookresearch/SlowFast/) and tested on our data. Currently, we only support the testing of MViT models, training will be available soon.*
+
+1. The values in columns named after "reference" are copied from paper
+2. The validation set of Kinetics400 we used consists of 19796 videos. These videos are available at [Kinetics400-Validation](https://mycuhk-my.sharepoint.com/:u:/g/personal/1155136485_link_cuhk_edu_hk/EbXw2WX94J1Hunyt3MWNDJUBz-nHvQYhO9pvKqm6g39PMA?e=a9QldB). The corresponding [data list](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_val_list.txt) (each line is of the format 'video_id, num_frames, label_index') and the [label map](https://download.openmmlab.com/mmaction/dataset/k400_val/kinetics_class2ind.txt) are also available.
+
+For more details on data preparation, you can refer to [Kinetics400](/tools/data/kinetics/README.md).
+
+## Test
+
+You can use the following command to test a model.
+
+```shell
+python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments]
+```
+
+Example: test MViT model on Kinetics-400 dataset and dump the result to a pkl file.
+
+```shell
+python tools/test.py configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py \
+    checkpoints/SOME_CHECKPOINT.pth --dump result.pkl
+```
+
+For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
+
+## Citation
+
+```bibtex
+@inproceedings{li2021improved,
+  title={MViTv2: Improved multiscale vision transformers for classification and detection},
+  author={Li, Yanghao and Wu, Chao-Yuan and Fan, Haoqi and Mangalam, Karttikeya and Xiong, Bo and Malik, Jitendra and Feichtenhofer, Christoph},
+  booktitle={CVPR},
+  year={2022}
+}
+```
diff --git a/configs/recognition/mvit/metafile.yml b/configs/recognition/mvit/metafile.yml
@@ -0,0 +1,115 @@
+Collections:
+- Name: MViT
+  README: configs/recognition/MViT/README.md
+  Paper:
+    URL: http://openaccess.thecvf.com//content/CVPR2022/papers/Li_MViTv2_Improved_Multiscale_Vision_Transformers_for_Classification_and_Detection_CVPR_2022_paper.pdf
+    Title: "MViTv2: Improved Multiscale Vision Transformers for Classification and Detection"
+
+Models:
+  - Name: mvit-small-p244_16x4x1_kinetics400-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_16x4x1_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
+      Code: https://github.com/facebookresearch/SlowFast/
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.1
+        Top 5 Accuracy: 94.7
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_16x4x1_kinetics400-rgb_20221021-9ebaaeed.pth
+
+  - Name: mvit-base-p244_32x3x1_kinetics400-rgb
+    Config: configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-base
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
+      Code: https://github.com/facebookresearch/SlowFast/
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.1
+        Top 5 Accuracy: 94.7
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_32x3x1_kinetics400-rgb_20221021-f392cd2d.pth
+
+  - Name: mvit-large-p244_40x3x1_kinetics400-rgb
+    Config: configs/recognition/mvit/mvit-large-p244_40x3x1_kinetics400-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-large
+      Resolution: short-side 446
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
+      Code: https://github.com/facebookresearch/SlowFast/
+    Results:
+    - Dataset: Kinetics-400
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 81.1
+        Top 5 Accuracy: 94.7
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_40x3x1_kinetics400-rgb_20221021-11fe1f97.pth
+
+  - Name: mvit-small-p244_u16_sthv2-rgb
+    Config: configs/recognition/mvit/mvit-small-p244_u16_sthv2-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
+      Code: https://github.com/facebookresearch/SlowFast/
+    Results:
+    - Dataset: SthV2
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 68.1
+        Top 5 Accuracy: 91.0
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-small-p244_u16_sthv2-rgb_20221021-65ecae7d.pth
+
+  - Name: mvit-base-p244_u32_sthv2-rgb
+    Config: configs/recognition/mvit/mvit-base-p244_u32_sthv2-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Resolution: short-side 320
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
+      Code: https://github.com/facebookresearch/SlowFast/
+    Results:
+    - Dataset: SthV2
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 70.8
+        Top 5 Accuracy: 92.7
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-base-p244_u32_sthv2-rgb_20221021-d5de5da6.pth
+
+  - Name: mvit-large-p244_u40_sthv2-rgb
+    Config: configs/recognition/mvit/mvit-large-p244_u40_sthv2-rgb.py
+    In Collection: MViT
+    Metadata:
+      Architecture: MViT-small
+      Resolution: short-side 446
+    Modality: RGB
+    Converted From:
+      Weights: https://github.com/facebookresearch/SlowFast/blob/main/projects/mvitv2/README.md
+      Code: https://github.com/facebookresearch/SlowFast/
+    Results:
+    - Dataset: SthV2
+      Task: Action Recognition
+      Metrics:
+        Top 1 Accuracy: 73.2
+        Top 5 Accuracy: 94.0
+    Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/mvit/converted/mvit-large-p244_u40_sthv2-rgb_20221021-61696e07.pth
diff --git a/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py b/configs/recognition/mvit/mvit-base-p244_32x3x1_kinetics400-rgb.py
@@ -0,0 +1,150 @@
+_base_ = [
+    '../../_base_/models/mvit_small.py', '../../_base_/default_runtime.py'
+]
+
+model = dict(
+    backbone=dict(
+        arch='base',
+        temporal_size=32,
+        drop_path_rate=0.3,
+    ),
+    data_preprocessor=dict(
+        type='ActionDataPreprocessor',
+        mean=[114.75, 114.75, 114.75],
+        std=[57.375, 57.375, 57.375],
+        blending=dict(
+            type='RandomBatchAugment',
+            augments=[
+                dict(type='MixupBlending', alpha=0.8, num_classes=400),
+                dict(type='CutmixBlending', alpha=1, num_classes=400)
+            ]),
+        format_shape='NCTHW'),
+)
+
+# dataset settings
+dataset_type = 'VideoDataset'
+data_root = 'data/kinetics400/videos_train'
+data_root_val = 'data/kinetics400/videos_val'
+ann_file_train = 'data/kinetics400/kinetics400_train_list_videos.txt'
+ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
+ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'
+
+file_client_args = dict(io_backend='disk')
+train_pipeline = [
+    dict(type='DecordInit', **file_client_args),
+    dict(type='SampleFrames', clip_len=32, frame_interval=3, num_clips=1),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(
+        type='PytorchVideoWrapper',
+        op='RandAugment',
+        magnitude=7,
+        num_layers=4),
+    dict(type='RandomResizedCrop'),
+    dict(type='Resize', scale=(224, 224), keep_ratio=False),
+    dict(type='Flip', flip_ratio=0.5),
+    dict(type='RandomErasing', erase_prob=0.25, mode='rand'),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+val_pipeline = [
+    dict(type='DecordInit', **file_client_args),
+    dict(
+        type='SampleFrames',
+        clip_len=32,
+        frame_interval=3,
+        num_clips=1,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 256)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+test_pipeline = [
+    dict(type='DecordInit', **file_client_args),
+    dict(
+        type='SampleFrames',
+        clip_len=32,
+        frame_interval=3,
+        num_clips=5,
+        test_mode=True),
+    dict(type='DecordDecode'),
+    dict(type='Resize', scale=(-1, 224)),
+    dict(type='CenterCrop', crop_size=224),
+    dict(type='FormatShape', input_format='NCTHW'),
+    dict(type='PackActionInputs')
+]
+
+train_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=True),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_train,
+        data_prefix=dict(video=data_root),
+        pipeline=train_pipeline))
+val_dataloader = dict(
+    batch_size=8,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_val,
+        data_prefix=dict(video=data_root_val),
+        pipeline=val_pipeline,
+        test_mode=True))
+test_dataloader = dict(
+    batch_size=1,
+    num_workers=8,
+    persistent_workers=True,
+    sampler=dict(type='DefaultSampler', shuffle=False),
+    dataset=dict(
+        type=dataset_type,
+        ann_file=ann_file_test,
+        data_prefix=dict(video=data_root_val),
+        pipeline=test_pipeline,
+        test_mode=True))
+
+val_evaluator = dict(type='AccMetric')
+test_evaluator = val_evaluator
+
+train_cfg = dict(
+    type='EpochBasedTrainLoop', max_epochs=30, val_begin=1, val_interval=3)
+val_cfg = dict(type='ValLoop')
+test_cfg = dict(type='TestLoop')
+
+optim_wrapper = dict(
+    type='AmpOptimWrapper',
+    optimizer=dict(
+        type='AdamW', lr=1.6e-3, betas=(0.9, 0.999), weight_decay=0.05))
+
+param_scheduler = [
+    dict(
+        type='LinearLR',
+        start_factor=0.1,
+        by_epoch=True,
+        begin=0,
+        end=30,
+        convert_to_iter_based=True),
+    dict(
+        type='CosineAnnealingLR',
+        T_max=200,
+        eta_min=0,
+        by_epoch=True,
+        begin=0,
+        end=200,
+        convert_to_iter_based=True)
+]
+
+default_hooks = dict(
+    checkpoint=dict(interval=3, max_keep_ckpts=5), logger=dict(interval=100))
+
+# Default setting for scaling LR automatically
+#   - `enable` means enable scaling LR automatically
+#       or not by default.
+#   - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
+auto_scale_lr = dict(enable=False, base_batch_size=64)