This is a simple machine learning model which can take a video of a person speaking and predict what it was.
- Used tensorflow for building the model
- Used keras for data processing and numpy for better array usage
- Used sequential model for training and prediction
- Used relu as the activation
- I have used 3 conv3d, 2 bidirectional lstm layers for traning
- Used adam optimizer and ctc loss for training
- Used imageio for reading the video and cv2 for getting the frames
- Made two functions load_video and load_alignments for loading the video and the alignments
- Video function convert the video frames to grayscale and crops to the mouth portions for lesser training time
- The alignments function use the word outputs from the files and store it as tokens later convert them to numbers
- Now we use a mappable function to get all the inputs to the function.
- we build a sequential model with 3 conv3d layers and 2 bidirectional lstm layers using tensorflow keras layers
- We use relu as the activation function and adam as the optimizer
- We use ctc loss for training the model
- we train the model using the fit function for particular epochs (over 90 epochs for better accuracy)
- We use the model to predict the speech of the video
- Download the dataset from the link given below
- Extract the dataset and place it in the same folder as the code
- Run the code using jupyter notebook or any other IDE
- The code will train the model and predict the speech of the video
model layers used in the code
model = Sequential()
# conv3D used for video processing input shape is the shape of each frame and 128 output filters and 3 is 3d kernel size
model.add(Conv3D(128,3,input_shape=(75,46,140,1),padding='same'))
# to get some non linearities
model.add(Activation('relu'))
# takes max values of each frame and condences into 2x2 kernel
model.add(MaxPool3D((1,2,2)))
# 2nd layer with 256 output filters
model.add(Conv3D(256,3,padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))
model.add(Conv3D(75,3,padding='same'))
model.add(Activation('relu'))
model.add(MaxPool3D((1,2,2)))
# flatten the output to feed into dense layer
model.add(TimeDistributed(Flatten()))
# 2 layer LSTM
# return_sequences=True means it will return the output of each time step
# dropout to prevent overfitting
# kernel_initializer='Orthogonal' to prevent vanishing gradient problem
# 128 is the number of hidden units
model.add(Bidirectional(LSTM(128, kernel_initializer='Orthogonal',return_sequences=True)))
model.add(Dropout(.5))
model.add(Bidirectional(LSTM(128, kernel_initializer='Orthogonal',return_sequences=True)))
model.add(Dropout(.5))
# dense layer with softmax activation
# output in the form of one hot encoding of the characters in the vocabulary + 1 for blank character
# using softmax activation to get the probability of each character then take the max probability using argmax
model.add(Dense(char_to_num.vocabulary_size()+1, kernel_initializer='he_normal',activation='softmax'))
import gdown
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv3D, MaxPool3D, TimeDistributed, LSTM, Bidirectional
import numpy as np
import imageio
import matplotlib.pyplot as plt
import cv2
pip install tensorflow
pip install keras
pip install numpy
pip install imageio
pip install matplotlib
pip install cv2