[Link to Paper - Coming Soon! 📄]
[Link to Thesis - Coming Soon! 📚]
SynthAVSR is an advanced framework for Audiovisual Speech Recognition (AVSR) that leverages synthetic data to bridge the gap in AVSR technology. Building upon AV-HuBERT, a self-supervised framework, this project aims to push the boundaries of AVSR by focusing on Spanish🇪🇸 and Catalan languages. It uses a novel approach to generate synthetic audiovisual data for training, with the goal of achieving state-of-the-art performance in lip-reading, ASR, and audiovisual speech recognition. 🌟
If you find SynthAVSR useful for your research, please cite our upcoming publication (details to be added here soon).
@article{buitrago2024synthavsr,
author = {Pol Buitrago},
title = {SynthAVSR: Leveraging Synthetic Data for Advancing Audiovisual Speech Recognition},
journal = {arXiv preprint (coming soon)},
year = {2024}
}
Checkpoints and models adapted for our project are available in the table below:
Modality | MixAVSR | RealAVSR | SynthAVSRGAN | CAT-AVSR |
---|---|---|---|---|
AudioVisual | Download | Download | Download | Download |
Audio-Only | Download | Download | Download | Download |
Visual-Only | Download | Download | Download | Download |
Model | LIP-RTVE | CMU-MOSEASES | MuAViCES |
---|---|---|---|
MixAVSR | 8.2% | 14.2% | 15.7% |
RealAVSR | 9.3% | 15.4% | 16.6% |
SynthAVSRGAN | 21.1% | 35.2% | 39.6% |
Model | AVCAT-Benchmark |
---|---|
CAT-AVSR | 25% |
To get started with SynthAVSR, set up a Conda environment using the SynthAVSR.yml
file provided:
-
Create and activate the environment:
conda env create -f SynthAVSR.yml conda activate synth_avsr
-
Clone the repository:
git clone https://github.com/Pol-Buitrago/SynthAVSR.git cd SynthAVSR git submodule init git submodule update
Follow the steps in preparation
to pre-process:
- LRS3 and VoxCeleb2 datasets. For any other dataset, follow an analogous procedure.
Follow the steps in clustering
(for pre-training only) to create:
{train, valid}.km
frame-aligned pseudo label files.
Thelabel_rate
is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.
To train a model, run the following command, adjusting paths as necessary:
$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
task.data=/path/to/data task.label_dir=/path/to/label \
model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
common.user_dir=`pwd`
To fine-tune a pre-trained HuBERT model at /path/to/checkpoint
, run:
$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
task.data=/path/to/data task.label_dir=/path/to/label \
task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`
To decode a fine-tuned model, run:
$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
common_eval.results_path=/path/to/experiment/decode/s2s/test \
override.modalities=['auido,video'] common.user_dir=`pwd`
Parameters like generation.beam
and generation.lenpen
can be adjusted to fine-tune the decoding process.
This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). You can freely share, modify, and distribute the code, but it cannot be used for commercial purposes.
See the full license text.