Automatic Speech Recognition with Listen-Attend-Spell Architecture

Paper

What is this Repository?

This is an implementation of the Listen-Attend-Spell, a Sequence to Sequence Model in Pytorch that is capable of performing Automatic Speech Recognition. A few salient features of this repository are:

Custom implementation of Energy based Attention using Batched Matrix Multiplications in Torch.
Custom implementation of a recurrent network architecture in the decoder.
Custom implementation of a pyramidal LSTM (not available in torch repository).
Reaches a WER of 8.0, better than the original paper.

Installation

Use conda to install a custom environment : conda create -f environment.yaml
- conda activate LAS
bash install_libraries.bash

This will create your conda environment

Preparing the dataset

We use the LibriSpeech dataset with a few pre-processing steps. In particular we convert the audiowaves into MEL spectogram as recommended by the paper. For your ease of use, you can simply download the dataset from my google drive with

bash load_data.bash

Training a model

The below set of hyperparameters should give you the best performance on this dataset:

python train.py --lr 3e-4 -wd 1e-6 -bs 64 -e 100 -nl 4 -nld 3 -ed 256 -dd 256 -ebd 128 -kvs 256 -sim 0 -dp LAS-Dataset/complete -rp runs/ -w 1 -wu 0 -drp 0.3 In general a few observations regarding the parameters:

The bigger the Embed dimensions, Key, Value dimensions are the faster attention converges but also the more memory hungry the model gets.
You need a 100% teacher forcing rate till at least the Attention converges, if the teacher forcing rate falls below 60%, attention begins to diverge. In general keep Teacher Forcing Rate around 80% always and only start decreasing it after around 10 epochs.
Removing weight tying, increasing decoder depth, increasing encoder depth did not seem to affect final performance.

If your hyperparameters are good, your attention should converge in around 10 epochs. Here is how that converged attention would look (quite similar to the paper):

License

This repository is under MIT License. Feel free to use any part of this code in your projects. Mail stsashank6@gmail.com for any questions.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.github		.github
__pycache__		__pycache__
.gitignore		.gitignore
LAS_model.py		LAS_model.py
Readme.md		Readme.md
attention.png		attention.png
command.sh		command.sh
dataloader.py		dataloader.py
environment.yaml		environment.yaml
install_libraries.bash		install_libraries.bash
load_data.bash		load_data.bash
sbatch		sbatch
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Speech Recognition with Listen-Attend-Spell Architecture

Paper

What is this Repository?

Installation

Preparing the dataset

Training a model

License

About

Releases

Packages

Languages

sashank-tirumala/Listen-Attend-Spell-ASR

Folders and files

Latest commit

History

Repository files navigation

Automatic Speech Recognition with Listen-Attend-Spell Architecture

Paper

What is this Repository?

Installation

Preparing the dataset

Training a model

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages