This repository contains the official PyTorch implementation of the following paper:
Multi-modality Associative Bridging through Memory: Speech Sound Recollected from Face
Minsu Kim*, Joanna Hong*, Sejin Park, and Yong Man Ro (*Equal contribution)
Paper: https://openaccess.thecvf.com/content/ICCV2021/papers/Kim_Multi-Modality_Associative_Bridging_Through_Memory_Speech_Sound_Recollected_From_Face_ICCV_2021_paper.pdf
- python 3.7
- pytorch 1.6 ~ 1.9
- torchvision
- torchaudio
- av
- tensorboard
- pillow
LRW dataset can be downloaded from the below link.
The pre-processing will be done in the data loader.
The video is cropped with the bounding box [x1:59, y1:95, x2:195, y2:231].
main.py
saves the weights in --checkpoint_dir
and shows the training logs in ./runs
.
To train the model, run following command:
# Distributed training example for LRW
python -m torch.distributed.launch --nproc_per_node='number of gpus' main.py \
--lrw 'enter_data_path' \
--checkpoint_dir 'enter_the_path_for_save' \
--batch_size 80 --epochs 200 \
--mode train --radius 16 --n_slot 88 \
--augmentations --distributed \
--gpu 0,1...
# Data Parallel training example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint_dir 'enter_the_path_for_save' \
--batch_size 320 --epochs 200 \
--mode train --radius 16 --n_slot 88 \
--augmentations --dataparallel \
--gpu 0,1...
Descriptions of training parameters are as follows:
--lrw
: training dataset location (lrw)--checkpoint_dir
: directory for saving checkpoints--batch_size
: batch size--epochs
: number of epochs--mode
: train / val / test--augmentations
: whether performing augmentation--distributed
: Use DataDistributedParallel--dataparallel
: Use DataParallel--gpu
: gpu for using--lr
: learning rate--n_slot
: memory slot size--radius
: scaling factor for addressing score- Refer to
main.py
for the other training parameters
To test the model, run following command:
# Testing example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint 'enter_the_checkpoint_path' \
--batch_size 80 \
--mode test --radius 16 --n_slot 88 \
--test_aug \
--gpu 0
Descriptions of training parameters are as follows:
--lrw
: training dataset location (lrw)--checkpoint
: the checkpoint file--batch_size
: batch size--mode
: train / val / test--test_aug
: whether performing test time augmentation--distributed
: Use DataDistributedParallel--dataparallel
: Use DataParallel--gpu
: gpu for using--lr
: learning rate--n_slot
: memory slot size--radius
: scaling factor for addressing score- Refer to
main.py
for the other testing parameters
You can download the pretrained models.
Put the ckpt in './data/'
Bi-GRU Backend
To test the pretrained model, run following command:
# Testing example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint ./data/GRU_Back_Ckpt.ckpt \
--batch_size 80 --backend GRU\
--mode test --radius 16 --n_slot 88 \
--test_aug True --distributed False --dataparallel False \
--gpu 0
MS-TCN Backend
To test the pretrained model, run following command:
# Testing example for LRW
python main.py \
--lrw 'enter_data_path' \
--checkpoint ./data/MSTCN_Back_Ckpt.ckpt \
--batch_size 80 --backend MSTCN\
--mode test --radius 16 --n_slot 168 \
--test_aug True --distributed False --dataparallel False \
--gpu 0
Architecture | Acc. |
---|---|
Resnet18 + MS-TCN + Multi-modal Mem | 85.864 |
Resnet18 + Bi-GRU + Multi-modal Mem | 85.408 |
You can also use the pre-trained model to perform Audio Visual Speech Recognition (AVSR), since it is trained with both audio and video inputs.
In order to use AVSR, just use ''tr_fusion'' (refer to the train code) for prediction.
If you find this work useful in your research, please cite the paper:
@inproceedings{kim2021multimodalmem,
title={Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video},
author={Kim, Minsu and Hong, Joanna and Park, Se Jin and Ro, Yong Man},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={296--306},
year={2021}
}
@article{kim2021cromm,
title={Cromm-vsr: Cross-modal memory augmented visual speech recognition},
author={Kim, Minsu and Hong, Joanna and Park, Se Jin and Ro, Yong Man},
journal={IEEE Transactions on Multimedia},
year={2021},
publisher={IEEE}
}