Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Update TimeSformer models' README & metafile #2124

Merged
merged 6 commits into from
Dec 26, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@
# The testing is w/o. any cropping / flipping
val_pipeline = [
dict(
type='SampleAVAFrames', clip_len=16, frame_interval=4, test_mode=True),
type='SampleAVAFrames', clip_len=16, frame_interval=4, test_mode=True),
dict(type='RawFrameDecode', **file_client_args),
dict(type='Resize', scale=(-1, 256)),
dict(type='FormatShape', input_format='NCTHW', collapse=True),
Expand Down
20 changes: 9 additions & 11 deletions configs/recognition/timesformer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,17 @@ We present a convolution-free approach to video classification built exclusively

### Kinetics-400

| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | inference_time(video/s) | gpu_mem(M) | config | ckpt | log |
| :---------------------: | :------------: | :--: | :---------------------: | :----------: | :------: | :------: | :---------------------: | :--------: | :------------------------: | :-----------------------: | :----------------------: |
| 8x32x1 | short-side 320 | 8 | TimeSformer (divST) | ImageNet-21K | 77.96 | 93.57 | x | 15235 | [config](/configs/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-a4d0d01f.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
| 8x32x1 | short-side 320 | 8 | TimeSformer (jointST) | ImageNet-21K | 76.93 | 93.27 | x | 33358 | [config](/configs/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-8022d1c0.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
| 8x32x1 | short-side 320 | 8 | TimeSformer (spaceOnly) | ImageNet-21K | 76.98 | 92.83 | x | 12355 | [config](/configs/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb_20220815-78f05367.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.log) |

1. The **gpus** indicates the number of gpu (80G A100) we used to get the checkpoint. It is noteworthy that the configs we provide are used for 8 gpus as default.
According to the [Linear Scaling Rule](https://arxiv.org/abs/1706.02677), you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU,
e.g., lr=0.005 for 8 GPUs x 8 videos/gpu and lr=0.00375 for 8 GPUs x 6 videos/gpu.
| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | FLOPs | params | config | ckpt | log |
| :---------------------: | :--------: | :--: | :---------------------: | :----------: | :------: | :------: | :--------------: | :---: | :----: | :----------------------------: | :--------------------------: | :-------------------------: |
| 8x32x1 | 224x224 | 8 | TimeSformer (divST) | ImageNet-21K | 77.69 | 93.45 | 1 clip x 3 crop | 196G | 122M | [config](/configs/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-a4d0d01f.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the resolution setting changed?

| 8x32x1 | 224x224 | 8 | TimeSformer (jointST) | ImageNet-21K | 76.95 | 93.28 | 1 clip x 3 crop | 180G | 86.11M | [config](/configs/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-8022d1c0.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
| 8x32x1 | 224x224 | 8 | TimeSformer (spaceOnly) | ImageNet-21K | 76.93 | 92.88 | 1 clip x 3 crop | 141G | 86.11M | [config](/configs/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb_20220815-78f05367.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.log) |

1. The **gpus** indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set `--auto-scale-lr` when calling `tools/train.py`, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
2. We keep the test setting with the [original repo](https://github.com/facebookresearch/TimeSformer) (three crop x 1 clip).
3. The pretrained model `vit_base_patch16_224.pth` used by TimeSformer was converted from [vision_transformer](https://github.com/google-research/vision_transformer).

For more details on data preparation, you can refer to the **Prepare videos** part in the [Data Preparation Tutorial](/docs/en/user_guides/2_data_prepare.md).
For more details on data preparation, you can refer to [Kinetics400](/tools/data/kinetics/README.md).

## Train

Expand All @@ -46,7 +44,7 @@ Example: train TimeSformer model on Kinetics-400 dataset in a deterministic opti

```shell
python tools/train.py configs/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.py \
--cfg-options randomness.seed=0 randomness.deterministic=True
--seed=0 --deterministic
```

For more details, you can refer to the **Training** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
Expand Down
24 changes: 15 additions & 9 deletions configs/recognition/timesformer/metafile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,18 @@ Models:
Batch Size: 8
Epochs: 15
Pretrained: ImageNet-21K
Resolution: short-side 320
Resolution: 224x224
FLOPs: 196G
params: 122M
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 77.96
Top 5 Accuracy: 93.57
Top 1 Accuracy: 77.69
Top 5 Accuracy: 93.45
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-a4d0d01f.pth

Expand All @@ -35,16 +37,18 @@ Models:
Batch Size: 8
Epochs: 15
Pretrained: ImageNet-21K
Resolution: short-side 320
Resolution: 224x224
FLOPs: 180G
params: 86.11M
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 76.93
Top 5 Accuracy: 93.27
Top 1 Accuracy: 76.95
Top 5 Accuracy: 93.28
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-8022d1c0.pth

Expand All @@ -56,15 +60,17 @@ Models:
Batch Size: 8
Epochs: 15
Pretrained: ImageNet-21K
Resolution: short-side 320
Resolution: 224x224
FLOPs: 141G
params: 86.11M
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 76.98
Top 5 Accuracy: 92.83
Top 1 Accuracy: 76.93
Top 5 Accuracy: 92.88
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb_20220815-78f05367.pth
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@
ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'

file_client_args = dict(io_backend='disk')

train_pipeline = [
dict(type='DecordInit'),
dict(type='DecordInit', **file_client_args),
dict(type='SampleFrames', clip_len=8, frame_interval=32, num_clips=1),
dict(type='DecordDecode'),
dict(type='RandomRescale', scale_range=(256, 320)),
Expand All @@ -46,7 +48,7 @@
dict(type='PackActionInputs')
]
val_pipeline = [
dict(type='DecordInit'),
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=8,
Expand All @@ -60,7 +62,7 @@
dict(type='PackActionInputs')
]
test_pipeline = [
dict(type='DecordInit'),
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=8,
Expand Down Expand Up @@ -136,3 +138,9 @@
]

default_hooks = dict(checkpoint=dict(interval=5))

# Default setting for scaling LR automatically
# - `enable` means enable scaling LR automatically
# or not by default.
# - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=64)