Speaker identification/recognition using:
- The Free ST American English Corpus dataset (SLR45)
- Mel-frequency cepstrum coefficients (MFCC)
- Gaussian mixture models (GMM)
The The Free ST American English Corpus dataset (SLR45) can be found on SLR45. It is a free American English corpus by Surfingtech, containing utterances from 10 speakers (5 females and 5 males). Each speaker has about 350 utterances.
The Mel-Frequency Cepstrum Coefficients (MFCC) are used here, since they deliver the best results in speaker verification. MFCCs are commonly derived as follows:
- Take the Fourier transform of (a windowed excerpt of) a signal.
- Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows.
- Take the logs of the powers at each of the mel frequencies.
- Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
- The MFCCs are the amplitudes of the resulting spectrum.
According to D. Reynolds in Gaussian_Mixture_Models: A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori(MAP) estimation from a well-trained prior model.
This script require the follwing modules/libraries:
- Run.py : This is the main script and it will run the whole cycle (Data management > Models training > Speakers identification)
- DataManager.py: This script is responsible for the extracting and strcturing the data.
- ModelsTrainer.py:This script is responsible for training the Gaussian Mixture Models (GMM) for each speaker.
- SpeakerIdentifier.py:This script is responsible for Testing the system by identifying who is speaking in the test files.
- FeaturesExtractor.py:This script is responsible for extracting the MFCC features from the .wav files.
- SilenceEliminator.py: Silence eliminator from .wav files to speed/clean and optimize the input files for the features extraction.
- The code can be further optimized using multi-threading, acceleration libs and multi-processing.
- The accuracy can be further improved using GMM normalization aka a UBM-GMM system.