I. Aim To analyse sentiments based on acoustic features and to classify the sentiments into 10 classes.
II. Classes
- Female angry
- Female calm
- Feamle fearful
- Female happy
- Female sad
- Male angry
- Male calm
- Male fearful
- Male happy
- Male sad
III. DataSet
The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) contains 7356 files (total size: 24.8 GB). Out of speech and song we have used speech dataset. The database contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression. We have used Audio-only (16bit,48kHz) data. Speech file (Audio_Speech_Actors_01-24.zip, 215 MB) contains 1440 files: 60 trials per actor x 24 actors = 1440. Out of 8 emotions we have chosen calm, happy, sad, angry, fearful emotions from dataset for classification i.e. 960 samples.
IV. Speech file loading parameters
Sampling rate : 44.1 Khz Speech file duration : 2.5 seconds Hop length : 512 Number of frames : (44100*2.5) / 512 = 216 frames
We have experimented with different hop lengths and sampling rates. The chosen hop length and sampling rate gives good accuracy.
V. Requirements
python | tensorflow | librosa | matplotlib | keras | sklearn
VI. Feature
MFCC : Mel Frequency Cepstral Coefficents (MFCCs) are a feature widely used in automatic speech and speaker recognition. The main point to understand about speech is that the sounds generated by a human are filtered by the shape of the vocal tract including tongue, teeth etc. This shape determines what sound comes out. If we can determine the shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to accurately represent this envelope.
Steps to find MFCC:
- Frame the signal into short frames and for each frame calculate the periodogram estimate of the power spectrum.
- Apply the mel filterbank to the power spectra, sum the energy in each filter.
- Take the logarithm of all filterbank energies.
- Take the DCT of the log filterbank energies.
- Keep DCT coefficients 2-13, discard the rest.
VII. Exploratory Data Analysis (EDA)
- Waveform plot for speech sample
- Scaled MFCC (13x216) Number of nMFCC coefficients=13 Number of frames=216
3)CNN result
We have also tested MFCC with MLP and LSTM but CNN gave better performance than both of them.
Below Flowchart shows the overall flow of EDA and use cases presented to customer