kaldi-asr-aws

This code repo is in reference to the Medium Article for setting up Kaldi on AWS

Kaldi => Dockerfile
bash => Contains all the shell scripts
flask-app => Python code and HTML file
lambda => Python code for two lambda functions

Directory Structure Tree

D -> Represents Directory
F -> Represents File
cd /home/ec2-user/
  -audios D [ Files from S3 will be synced here ]
  -audiosKaldi D 
    -processing D [ Files ready for Transcription are moved here ]
    -sendWavToProcess.sh F
  -kaldi D
    -Dockerfile F
  -models D
    -model.zip F (unzip here)
    -transcriptMaster.sh F
    -transcriptWorker.sh F
  -output D [ Transcription in .txt file will be store here]
  ffmpeg F
  getConvertAudios.sh F
  uploadOutput.sh F

- Commands for this Directory Tree/Other Installations

mkdir audios kaldi models output
mkdir -p audiosKaldi/processing
wget -P /home/ec2-user/ https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz
tar -xvf ffmpeg-release-amd64-static.tar.xz
mv ffmpeg-4.2.1-amd64-static/ffmpeg ~/
sudo chmod 755 ~/ffmpeg
wget -P /home/ec2-user/models/ https://crossregionreplpuri.s3.ap-south-1.amazonaws.com/model.zip
unzip /home/ec2-user/models/model.zip -d /home/ec2-user/models/

sudo yum install -y git
sudo yum install -y docker
sudo service docker start
alias docker='sudo docker'

- Kaldi

Build the Kaldi container using Dockerfile

cd /path/to/Dockerfile
docker build -t kaldi .

Starting the container

docker run -d -it --name kaldi -v ~/models:/models/ -v ~/audiosKaldi/processing:/audios/ -v ~/output:/output/ kaldi bash

Entering into the container

docker exec -it kaldi bash

- Bash Scripts

getConvertAudios.sh -> This script syncs files from S3 into audios/ directory and using ffmpeg converted and stored into audiosKaldi/
uploadOutput.sh -> This script syncs the .txt files in output/ directory into S3 bucket
sendWavToProcess.sh -> This script limits the number of files for processing to the number of cores on the VM for parallel processing
transcriptMaster.sh -> This script calls transcriptWorker.sh for every audio file placed in the processing folder and ensures at any time only #no. of cores amount of files are running
transcriptWorker.sh -> Where the magic happens, actual transcription happens through this file.

- Flask App

Installation/Setup

sudo yum install python2-pip.noarch
sudo pip install virtualenv
virtualenv /path/to/some/directory
source /path/to/some/directory/bin/activate
pip install flask boto3 requests
###### Copy the flask-app files into - /path/to/some/directory
cd /path/to/some/directory
python app.py &

- Lambda

Refer Medium Article

- ffmpeg

Download ffmpeg-release-amd64-static.tar.xz - md5 from here

wget -P /home/ec2-user/ https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-amd64-static.tar.xz

- Kaldi Model

Download model.zip from here - https://crossregionreplpuri.s3.ap-south-1.amazonaws.com/model.zip

wget -P /home/ec2-user/models/ https://crossregionreplpuri.s3.ap-south-1.amazonaws.com/model.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kaldi-asr-aws

Directory Structure Tree

- Commands for this Directory Tree/Other Installations

- Kaldi

- Bash Scripts

- Flask App

- Lambda

- ffmpeg

- Kaldi Model

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
Kaldi		Kaldi
bash		bash
flask-app		flask-app
lambda		lambda
README.md		README.md

purijs/kaldi-asr-aws

Folders and files

Latest commit

History

Repository files navigation

kaldi-asr-aws

Directory Structure Tree

- Commands for this Directory Tree/Other Installations

- Kaldi

- Bash Scripts

- Flask App

- Lambda

- ffmpeg

- Kaldi Model

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages