Project for COSI 136a ASR
Command line: ffmpeg, sox/soxi
Python: Using Python 3.9+,
pip install -r requirements.txt
Resample all mp3 files in a directory to wav files:
./resample.sh <DIR>
Check the total length of the resampled files:
soxi resampled/ | tail -n1
Split directory of parallel .TextGrid and .wav files into short segments to use in a model:
usage: split_corpus.py [-h] [--max-seconds MAX_SECONDS] indir outdir
positional arguments:
indir Directory of parallel .TextGrid and .wav files to load
outdir Directory to write segmented parallel .txt and .wav files
options:
-h, --help show this help message and exit
--max-seconds MAX_SECONDS
Maximum duration in seconds of segmented audio files
Calculate statistics on the train, dev, and test splits (type/token counts, OOV rate):
usage: corpus_stats.py [-h] [--train TRAIN] [--dev DEV] [--test TEST]
options:
-h, --help show this help message and exit
--train TRAIN Directory for train partition containing .txt files
--dev DEV Directory for dev partition containing .txt files
--test TEST Directory for test partition containing .txt files
Follow the instructions in train.ipynb
to fine-tune a pre-trained Whisper model on the newly created data and
evaluate the results. Note: this has only been tested in Google Colab using a T4 GPU, so there is no guarantee it won't
crash on another platform/architecture, including on CPU.