forked from open-mmlab/mmaction2
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
debug doc of uniformerv1&v2 debug pre-commit of uniformerv1&v2 debug doc [Refactor] Refactor and Enhance 2s-AGCN (open-mmlab#2130) [Doc] Update TSN models' README & metafile (open-mmlab#2122) [Doc] Update TimeSformer models' README & metafile (open-mmlab#2124) fix mvit readme (open-mmlab#2125) [Doc] Update SlowOnly models' README & metafile (open-mmlab#2126) Co-authored-by: wxDai <wxDai2001@gmail.com> [Doc] Update TRN models' README & metafile (open-mmlab#2129) debug merge debug doc and test debug doc fix init_weight update readme update readme fix the readme
- Loading branch information
Showing
37 changed files
with
3,584 additions
and
13 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# UniFormer | ||
|
||
[UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning](https://arxiv.org/abs/2201.04676) | ||
|
||
<!-- [ALGORITHM] --> | ||
|
||
## Abstract | ||
|
||
<!-- [ABSTRACT] --> | ||
|
||
It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively. | ||
|
||
<!-- [IMAGE] --> | ||
|
||
<div align=center> | ||
<img src="https://raw.githubusercontent.com/Sense-X/UniFormer/main/figures/framework.png"/> | ||
</div> | ||
|
||
## Results and Models | ||
|
||
### Kinetics-400 | ||
|
||
| frame sampling strategy | resolution | backbone | top1 acc | top5 acc | [reference](<(https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md)>) top1 acc | [reference](<(https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md)>) top5 acc | testing protocol | FLOPs | params | config | ckpt | | ||
| :---------------------: | :------------: | :---------: | :------: | :------: | :-----------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------: | :---------------: | :---: | :----: | :-----------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------------------: | | ||
| 16x4x1 | short-side 320 | UniFormer-S | 80.9 | 94.6 | 80.8 | 94.7 | 4 clips x 1 crops | 41.8G | 21.4M | [config](/configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-c630a037.pth) | | ||
| 16x4x1 | short-side 320 | UniFormer-B | 82.0 | 95.0 | 82.0 | 95.1 | 4 clips x 1 crops | 96.7G | 49.8M | [config](/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-157c2e66.pth) | | ||
| 32x4x1 | short-side 320 | UniFormer-B | 83.1 | 95.3 | 82.9 | 95.4 | 4 clips x 1 crops | 59G | 49.8M | [config](/configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb_20221219-b776322c.pth) | | ||
|
||
The models are ported from the repo [UniFormer](https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md) and tested on our data. Currently, we only support the testing of UniFormer models, training will be available soon. | ||
|
||
1. The values in columns named after "reference" are the results of the original repo. | ||
2. The validation set of Kinetics400 we used consists of 19787 videos. The total videos are available at [Kinetics400](https://pan.baidu.com/s/1t5K0FRz3PGAT-37-3FwAfg) (BaiduYun password: g5kp). | ||
3. Since the original models for Kinetics-400/600/700 adopt different [label file](https://drive.google.com/drive/folders/17VB-XdF3Kfr9ORmnGyXCxTMs86n0L4QL), we simply map the weight according to the label name. New label map for Kinetics-400/600/700 can be found [here](https://github.com/open-mmlab/mmaction2/tree/dev-1.x/tools/data/kinetics). | ||
4. Due to some difference between [SlowFast](https://github.com/facebookresearch/SlowFast) and MMAction, there are some gaps between their performances. | ||
|
||
For more details on data preparation, you can refer to [preparing_kinetics](/tools/data/kinetics/README.md). | ||
|
||
## Test | ||
|
||
You can use the following command to test a model. | ||
|
||
```shell | ||
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [optional arguments] | ||
``` | ||
|
||
Example: test UniFormer-S model on Kinetics-400 dataset and dump the result to a pkl file. | ||
|
||
```shell | ||
python tools/test.py configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py \ | ||
checkpoints/SOME_CHECKPOINT.pth --dump result.pkl | ||
``` | ||
|
||
For more details, you can refer to the **Test** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md). | ||
|
||
## Citation | ||
|
||
```BibTeX | ||
@inproceedings{ | ||
li2022uniformer, | ||
title={UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning}, | ||
author={Kunchang Li and Yali Wang and Gao Peng and Guanglu Song and Yu Liu and Hongsheng Li and Yu Qiao}, | ||
booktitle={International Conference on Learning Representations}, | ||
year={2022}, | ||
url={https://openreview.net/forum?id=nBU_u6DLvoK} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
Collections: | ||
- Name: UniFormer | ||
README: configs/recognition/uniformer/README.md | ||
Paper: | ||
URL: https://arxiv.org/abs/2201.04676 | ||
Title: "UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning" | ||
|
||
Models: | ||
- Name: uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb | ||
Config: configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py | ||
In Collection: UniFormer | ||
Metadata: | ||
Architecture: UniFormer-S | ||
Pretrained: ImageNet-1K | ||
Resolution: short-side 320 | ||
Frame: 16 | ||
Sampling rate: 4 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md | ||
Code: https://github.com/Sense-X/UniFormer/tree/main/video_classification | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 80.9 | ||
Top 5 Accuracy: 94.6 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-c630a037.pth | ||
|
||
- Name: uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb | ||
Config: configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py | ||
In Collection: UniFormer | ||
Metadata: | ||
Architecture: UniFormer-B | ||
Pretrained: ImageNet-1K | ||
Resolution: short-side 320 | ||
Frame: 16 | ||
Sampling rate: 4 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md | ||
Code: https://github.com/Sense-X/UniFormer/tree/main/video_classification | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 82.0 | ||
Top 5 Accuracy: 95.0 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb_20221219-157c2e66.pth | ||
|
||
- Name: uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb | ||
Config: configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py | ||
In Collection: UniFormer | ||
Metadata: | ||
Architecture: UniFormer-B | ||
Pretrained: ImageNet-1K | ||
Resolution: short-side 320 | ||
Frame: 32 | ||
Sampling rate: 4 | ||
Modality: RGB | ||
Converted From: | ||
Weights: https://github.com/Sense-X/UniFormer/blob/main/video_classification/README.md | ||
Code: https://github.com/Sense-X/UniFormer/tree/main/video_classification | ||
Results: | ||
- Dataset: Kinetics-400 | ||
Task: Action Recognition | ||
Metrics: | ||
Top 1 Accuracy: 83.1 | ||
Top 5 Accuracy: 95.3 | ||
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/uniformerv1/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb_20221219-b776322c.pth |
58 changes: 58 additions & 0 deletions
58
configs/recognition/uniformer/uniformer-base_imagenet1k-pre_16x4x1_kinetics400-rgb.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
_base_ = ['../../_base_/default_runtime.py'] | ||
|
||
# model settings | ||
model = dict( | ||
type='Recognizer3D', | ||
backbone=dict( | ||
type='UniFormer', | ||
depth=[5, 8, 20, 7], | ||
embed_dim=[64, 128, 320, 512], | ||
head_dim=64, | ||
drop_path_rate=0.3), | ||
cls_head=dict( | ||
type='I3DHead', | ||
dropout_ratio=0., | ||
num_classes=400, | ||
in_channels=512, | ||
average_clips='prob'), | ||
data_preprocessor=dict( | ||
type='ActionDataPreprocessor', | ||
mean=[114.75, 114.75, 114.75], | ||
std=[57.375, 57.375, 57.375], | ||
format_shape='NCTHW')) | ||
|
||
# dataset settings | ||
dataset_type = 'VideoDataset' | ||
data_root_val = 'data/k400' | ||
ann_file_test = 'data/k400/val.csv' | ||
|
||
test_pipeline = [ | ||
dict(type='DecordInit'), | ||
dict( | ||
type='SampleFrames', | ||
clip_len=16, | ||
frame_interval=4, | ||
num_clips=4, | ||
test_mode=True), | ||
dict(type='DecordDecode'), | ||
dict(type='Resize', scale=(-1, 224)), | ||
dict(type='CenterCrop', crop_size=224), | ||
dict(type='FormatShape', input_format='NCTHW'), | ||
dict(type='PackActionInputs') | ||
] | ||
|
||
test_dataloader = dict( | ||
batch_size=32, | ||
num_workers=8, | ||
persistent_workers=True, | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
dataset=dict( | ||
type=dataset_type, | ||
ann_file=ann_file_test, | ||
data_prefix=dict(video=data_root_val), | ||
pipeline=test_pipeline, | ||
test_mode=True, | ||
delimiter=',')) | ||
|
||
test_evaluator = dict(type='AccMetric') | ||
test_cfg = dict(type='TestLoop') |
58 changes: 58 additions & 0 deletions
58
configs/recognition/uniformer/uniformer-base_imagenet1k-pre_32x4x1_kinetics400-rgb.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
_base_ = ['../../_base_/default_runtime.py'] | ||
|
||
# model settings | ||
model = dict( | ||
type='Recognizer3D', | ||
backbone=dict( | ||
type='UniFormer', | ||
depth=[5, 8, 20, 7], | ||
embed_dim=[64, 128, 320, 512], | ||
head_dim=64, | ||
drop_path_rate=0.3), | ||
cls_head=dict( | ||
type='I3DHead', | ||
dropout_ratio=0., | ||
num_classes=400, | ||
in_channels=512, | ||
average_clips='prob'), | ||
data_preprocessor=dict( | ||
type='ActionDataPreprocessor', | ||
mean=[114.75, 114.75, 114.75], | ||
std=[57.375, 57.375, 57.375], | ||
format_shape='NCTHW')) | ||
|
||
# dataset settings | ||
dataset_type = 'VideoDataset' | ||
data_root_val = 'data/k400' | ||
ann_file_test = 'data/k400/val.csv' | ||
|
||
test_pipeline = [ | ||
dict(type='DecordInit'), | ||
dict( | ||
type='SampleFrames', | ||
clip_len=32, | ||
frame_interval=4, | ||
num_clips=4, | ||
test_mode=True), | ||
dict(type='DecordDecode'), | ||
dict(type='Resize', scale=(-1, 224)), | ||
dict(type='CenterCrop', crop_size=224), | ||
dict(type='FormatShape', input_format='NCTHW'), | ||
dict(type='PackActionInputs') | ||
] | ||
|
||
test_dataloader = dict( | ||
batch_size=16, | ||
num_workers=8, | ||
persistent_workers=True, | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
dataset=dict( | ||
type=dataset_type, | ||
ann_file=ann_file_test, | ||
data_prefix=dict(video=data_root_val), | ||
pipeline=test_pipeline, | ||
test_mode=True, | ||
delimiter=',')) | ||
|
||
test_evaluator = dict(type='AccMetric') | ||
test_cfg = dict(type='TestLoop') |
58 changes: 58 additions & 0 deletions
58
configs/recognition/uniformer/uniformer-small_imagenet1k-pre_16x4x1_kinetics400-rgb.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
_base_ = ['../../_base_/default_runtime.py'] | ||
|
||
# model settings | ||
model = dict( | ||
type='Recognizer3D', | ||
backbone=dict( | ||
type='UniFormer', | ||
depth=[3, 4, 8, 3], | ||
embed_dim=[64, 128, 320, 512], | ||
head_dim=64, | ||
drop_path_rate=0.1), | ||
cls_head=dict( | ||
type='I3DHead', | ||
dropout_ratio=0., | ||
num_classes=400, | ||
in_channels=512, | ||
average_clips='prob'), | ||
data_preprocessor=dict( | ||
type='ActionDataPreprocessor', | ||
mean=[114.75, 114.75, 114.75], | ||
std=[57.375, 57.375, 57.375], | ||
format_shape='NCTHW')) | ||
|
||
# dataset settings | ||
dataset_type = 'VideoDataset' | ||
data_root_val = 'data/k400' | ||
ann_file_test = 'data/k400/val.csv' | ||
|
||
test_pipeline = [ | ||
dict(type='DecordInit'), | ||
dict( | ||
type='SampleFrames', | ||
clip_len=16, | ||
frame_interval=4, | ||
num_clips=4, | ||
test_mode=True), | ||
dict(type='DecordDecode'), | ||
dict(type='Resize', scale=(-1, 224)), | ||
dict(type='CenterCrop', crop_size=224), | ||
dict(type='FormatShape', input_format='NCTHW'), | ||
dict(type='PackActionInputs') | ||
] | ||
|
||
test_dataloader = dict( | ||
batch_size=32, | ||
num_workers=8, | ||
persistent_workers=True, | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
dataset=dict( | ||
type=dataset_type, | ||
ann_file=ann_file_test, | ||
data_prefix=dict(video=data_root_val), | ||
pipeline=test_pipeline, | ||
test_mode=True, | ||
delimiter=',')) | ||
|
||
test_evaluator = dict(type='AccMetric') | ||
test_cfg = dict(type='TestLoop') |
Oops, something went wrong.