Implementation in Pytorch of the use case model, as described in the paper "A Large and Clean Multilingual Corpus of Sentence Aligned Spoken Utterances Extracted from the Bible" (accepted at LREC 2020)
- pytorch
- librosa
1) Download the data (or build the corpus yourself using the following scripts)
You will need to download the pre-computed mel-spectrograms of the data set (such a used in the paper's experiments) here. These mel-spectrograms were compute with extract_spectrogram.py
Build the train/val/test splits with build_splits.py. This script will need a CSV file as input which sums up which verses are available for which language. This CSV file can be computed with the following script. If you downloaded the pre-computed mel-spectrograms, this file was packed with the mel-spectrograms and is available here You may use make_data.sh to build the splits for english-X language pairs.
Train a model using run_bible.py and evaluate it with test_bible.py
python run.py train.json --data-val val.json