Skip to content

Commit

Permalink
[Doc] Update TimeSformer models' README & metafile (#2124)
Browse files Browse the repository at this point in the history
  • Loading branch information
hukkai authored Dec 26, 2022
1 parent cebdb56 commit c62f9c2
Show file tree
Hide file tree
Showing 3 changed files with 35 additions and 23 deletions.
20 changes: 9 additions & 11 deletions configs/recognition/timesformer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,19 +20,17 @@ We present a convolution-free approach to video classification built exclusively

### Kinetics-400

| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | inference_time(video/s) | gpu_mem(M) | config | ckpt | log |
| :---------------------: | :------------: | :--: | :---------------------: | :----------: | :------: | :------: | :---------------------: | :--------: | :------------------------: | :-----------------------: | :----------------------: |
| 8x32x1 | short-side 320 | 8 | TimeSformer (divST) | ImageNet-21K | 77.96 | 93.57 | x | 15235 | [config](/configs/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-a4d0d01f.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
| 8x32x1 | short-side 320 | 8 | TimeSformer (jointST) | ImageNet-21K | 76.93 | 93.27 | x | 33358 | [config](/configs/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-8022d1c0.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
| 8x32x1 | short-side 320 | 8 | TimeSformer (spaceOnly) | ImageNet-21K | 76.98 | 92.83 | x | 12355 | [config](/configs/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb_20220815-78f05367.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.log) |

1. The **gpus** indicates the number of gpu (80G A100) we used to get the checkpoint. It is noteworthy that the configs we provide are used for 8 gpus as default.
According to the [Linear Scaling Rule](https://arxiv.org/abs/1706.02677), you may set the learning rate proportional to the batch size if you use different GPUs or videos per GPU,
e.g., lr=0.005 for 8 GPUs x 8 videos/gpu and lr=0.00375 for 8 GPUs x 6 videos/gpu.
| frame sampling strategy | resolution | gpus | backbone | pretrain | top1 acc | top5 acc | testing protocol | FLOPs | params | config | ckpt | log |
| :---------------------: | :--------: | :--: | :---------------------: | :----------: | :------: | :------: | :--------------: | :---: | :----: | :----------------------------: | :--------------------------: | :-------------------------: |
| 8x32x1 | 224x224 | 8 | TimeSformer (divST) | ImageNet-21K | 77.69 | 93.45 | 1 clip x 3 crop | 196G | 122M | [config](/configs/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-a4d0d01f.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
| 8x32x1 | 224x224 | 8 | TimeSformer (jointST) | ImageNet-21K | 76.95 | 93.28 | 1 clip x 3 crop | 180G | 86.11M | [config](/configs/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-8022d1c0.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.log) |
| 8x32x1 | 224x224 | 8 | TimeSformer (spaceOnly) | ImageNet-21K | 76.93 | 92.88 | 1 clip x 3 crop | 141G | 86.11M | [config](/configs/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.py) | [ckpt](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb_20220815-78f05367.pth) | [log](https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.log) |

1. The **gpus** indicates the number of gpus we used to get the checkpoint. If you want to use a different number of gpus or videos per gpu, the best way is to set `--auto-scale-lr` when calling `tools/train.py`, this parameter will auto-scale the learning rate according to the actual batch size and the original batch size.
2. We keep the test setting with the [original repo](https://github.com/facebookresearch/TimeSformer) (three crop x 1 clip).
3. The pretrained model `vit_base_patch16_224.pth` used by TimeSformer was converted from [vision_transformer](https://github.com/google-research/vision_transformer).

For more details on data preparation, you can refer to the **Prepare videos** part in the [Data Preparation Tutorial](/docs/en/user_guides/2_data_prepare.md).
For more details on data preparation, you can refer to [Kinetics400](/tools/data/kinetics/README.md).

## Train

Expand All @@ -46,7 +44,7 @@ Example: train TimeSformer model on Kinetics-400 dataset in a deterministic opti

```shell
python tools/train.py configs/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.py \
--cfg-options randomness.seed=0 randomness.deterministic=True
--seed=0 --deterministic
```

For more details, you can refer to the **Training** part in the [Training and Test Tutorial](/docs/en/user_guides/4_train_test.md).
Expand Down
24 changes: 15 additions & 9 deletions configs/recognition/timesformer/metafile.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,18 @@ Models:
Batch Size: 8
Epochs: 15
Pretrained: ImageNet-21K
Resolution: short-side 320
Resolution: 224x224
FLOPs: 196G
params: 122M
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 77.96
Top 5 Accuracy: 93.57
Top 1 Accuracy: 77.69
Top 5 Accuracy: 93.45
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_divST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-a4d0d01f.pth

Expand All @@ -35,16 +37,18 @@ Models:
Batch Size: 8
Epochs: 15
Pretrained: ImageNet-21K
Resolution: short-side 320
Resolution: 224x224
FLOPs: 180G
params: 86.11M
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 76.93
Top 5 Accuracy: 93.27
Top 1 Accuracy: 76.95
Top 5 Accuracy: 93.28
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_jointST_8xb8-8x32x1-15e_kinetics400-rgb_20220815-8022d1c0.pth

Expand All @@ -56,15 +60,17 @@ Models:
Batch Size: 8
Epochs: 15
Pretrained: ImageNet-21K
Resolution: short-side 320
Resolution: 224x224
FLOPs: 141G
params: 86.11M
Training Data: Kinetics-400
Training Resources: 8 GPUs
Modality: RGB
Results:
- Dataset: Kinetics-400
Task: Action Recognition
Metrics:
Top 1 Accuracy: 76.98
Top 5 Accuracy: 92.83
Top 1 Accuracy: 76.93
Top 5 Accuracy: 92.88
Training Log: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb.log
Weights: https://download.openmmlab.com/mmaction/v1.0/recognition/timesformer/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb/timesformer_spaceOnly_8xb8-8x32x1-15e_kinetics400-rgb_20220815-78f05367.pth
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@
ann_file_val = 'data/kinetics400/kinetics400_val_list_videos.txt'
ann_file_test = 'data/kinetics400/kinetics400_val_list_videos.txt'

file_client_args = dict(io_backend='disk')

train_pipeline = [
dict(type='DecordInit'),
dict(type='DecordInit', **file_client_args),
dict(type='SampleFrames', clip_len=8, frame_interval=32, num_clips=1),
dict(type='DecordDecode'),
dict(type='RandomRescale', scale_range=(256, 320)),
Expand All @@ -46,7 +48,7 @@
dict(type='PackActionInputs')
]
val_pipeline = [
dict(type='DecordInit'),
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=8,
Expand All @@ -60,7 +62,7 @@
dict(type='PackActionInputs')
]
test_pipeline = [
dict(type='DecordInit'),
dict(type='DecordInit', **file_client_args),
dict(
type='SampleFrames',
clip_len=8,
Expand Down Expand Up @@ -136,3 +138,9 @@
]

default_hooks = dict(checkpoint=dict(interval=5))

# Default setting for scaling LR automatically
# - `enable` means enable scaling LR automatically
# or not by default.
# - `base_batch_size` = (8 GPUs) x (8 samples per GPU).
auto_scale_lr = dict(enable=False, base_batch_size=64)

0 comments on commit c62f9c2

Please sign in to comment.