Speech Emotion Recognition (SER) using CNNs and CRNNs Based on Mel Spectrograms and Mel Frequency Cepstral Coefficients (MFCCs).
This is the source code from this Article, which goes more into detail on Speech Emotion Recognition.
Speech Emotion Recognition (SER) involves analyzing various acoustic features of speech signals to identify emotional states. These features include pitch, volume, tempo, and pauses in speech. Machine learning algorithms are trained on labeled datasets containing audio recordings with corresponding emotion labels (e.g., happy, sad, angry). The models then learn patterns that correlate these acoustic features with specific emotions.
There are two primary approaches to Speech Emotion Recognition:
-
Rule-based systems: These systems rely on predefined rules and heuristics to map acoustic features to emotional states. For example, a high pitch might be associated with happiness while a low pitch could indicate sadness. However, these systems may struggle in cases where emotions are not clearly expressed through typical vocal patterns or when multiple emotions are present simultaneously.
-
Data-driven models: These models use machine learning algorithms to learn relationships between acoustic features and emotional states from labeled datasets. Data-driven approaches tend to be more accurate than rule-based systems but require large amounts of high-quality training data. Deep neural networks, such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), are commonly used in this context.
We will be using the second approach in this project due to its flexibility and superiour performance.
Four datasets are used in this project:
Three different models are used to classify the emotions:
-
A Convolutional Neural Network (CNN) trained on the Mel Spectrograms of the audio files. Mel Spectrograms look like this:
-
A CNN trained on the Mel Frequency Cepstral Coefficients (MFCCs) of the audio files. MFCCs look like this:
-
A Convolutional Recurrent Neural Network (CRNN) trained on the Mel Frequency Cepstral Coefficients of the audio files.
The project is organized as follows:
- Data Collection
- Data Wrangling
- Data Exploration, Preparation, and Visualization
- Data Preprocessing 5.1 Mel Spectrogram CNN Model 5.2 MFCCs CNN Model 5.3 MFCCs CRNN Model
Each of these sections is contained in the Jupyter notebook speech_emotion_recognition.ipynb
.
After training and testing the three models, we found that:
- The CNN model trained on Mel Spectrograms performed well, but had some difficulty distinguishing between certain emotions.
- The CNN model trained on MFCCs showed the best performance among all three models, indicating that MFCCs may be a more effective representation for this task.
- The CRNN model trained on MFCCs also performed well, but it did not outperform the CNN model trained on MFCCs and tended to overfit.
We evaluated the models using Precision, Recall, and the F1 score. These metrics provide a more comprehensive view of model performance than accuracy alone, as they take into account both false positives and false negatives. The CNN model trained on MFCCs achieved the highest scores on these metrics, further confirming its superior performance.
These are the scores for each model:
- The CNN model trained on Mel Spectrograms achieved an accuracy of 71%, with a macro average F1-score of 0.73.
- The CNN model trained on MFCCs showed the best performance among all three models, achieving an accuracy of 74%, with a macro average F1-score of 0.76.
- The CRNN model trained on MFCCs also performed well, achieving an accuracy of 74%, with a macro average F1-score of 0.75.
Here are the detailed scores for each model:
Mel Spectrogram Model
precision recall f1-score support
anger 0.75 0.77 0.76 390
disgust 0.67 0.67 0.67 369
fear 0.69 0.65 0.67 378
happiness 0.68 0.63 0.65 396
neutral 0.70 0.80 0.75 326
sadness 0.70 0.72 0.71 396
surprise 0.92 0.88 0.90 139
MFCCs CNN Model
precision recall f1-score support
anger 0.80 0.82 0.81 390
disgust 0.71 0.71 0.71 369
fear 0.70 0.66 0.68 378
happiness 0.73 0.69 0.71 396
neutral 0.75 0.77 0.76 326
sadness 0.70 0.75 0.73 396
surprise 0.93 0.92 0.93 139
MFCCs CRNN Model
precision recall f1-score support
anger 0.81 0.81 0.81 390
disgust 0.71 0.70 0.70 369
fear 0.70 0.66 0.68 378
happiness 0.72 0.68 0.70 396
neutral 0.75 0.75 0.75 326
sadness 0.68 0.75 0.71 396
surprise 0.91 0.91 0.91 139
These are the Accuracy and Loss values for each model per epoch:
To run the code in this repository, you will need to have Python installed, along with the libraries listed in requirements.txt
. You can install these with pip using the command pip install -r requirements.txt
.
Once the dependencies are installed, you can open the notebook speech_emotion_recognition.ipynb
in Jupyter and run the cells to execute the code.
This project is licensed under the terms of the MIT license.