-
Notifications
You must be signed in to change notification settings - Fork 8
Workshops
Spoken language is natively used by humans to communicate with each other, and a very important tool for social and cultural participation. For that purpose, acoustic speech signals, produced by physics with muscles and effortlessly steered by will, contain information that is transmitted by waves thought the air and can be heard, understood, and interpreted by other humans. The inability or difficulty to utter own thoughts or comprehend those of fellows human beings can severely compromise the quality of live. Linguists say that speech information is encoded in sentences, which consist of words, which in turn consist of phonemes, the smallest acoustic unit that can represent a difference in meaning. Although speech signals are usually very robust, i.e., the same information is encoded several times, which allows a degradation (e.g., when other acoustic signals are added) without a loss of information, eventually parts of the original message cannot be recovered anymore. The aim of speech intelligibility prediction is to model and predict in which (degraded) conditions speech signals cannot be correcty recovered anymore.
Speech intelligibility describes the abstract concept of a measure of how comprehensible a speech signal is, e.g., understanding "everything", "something", "nothing". Apart from questionaires, which are very subjective, speech intelligibility can be measured with speech recognition tests. For example, sentences can be presented in the presence of an interfering signal and the number of correcty recognized words can be counted. Often, the level or signal-to-noise ratio (SNR) at which a certain proportion (e.g, 50%) of the words can be correctly recognized is determined; it is called the speech reception threshold (SRT). The standardisation of the speech material and the method that is used to determine the SRT allows to compare it between different interferers, and also across different listeners. At the same overall presentation level ("volume"), some interferers deteriorate the speech signal more than others. For example, white noise degrades speech signals much more than a competing talker does. But also the hearing abilities of the listener have an effect on speech intelligibility. Listeners with impaired hearing show genereally worse results, i.e., higher SRTs, than those with normal hearing.
"Models" are things that are or behave similar to another thing in some aspect; a very general and unspecific concept. Here, we consider executable models that (ideally), given the speech material, the acoustic condition, and optionally the hearing status of the listener, generate a prediction of the outcome of the speech recognition test. However, many models require more information to predict an SRT.
The kind of information that is required to perform predictions is critical in this regard. Often used inputs to models of speech intelligibility can be classified as follows:
- Information that describes the recognition task, e.g, speech and noise signals, reverberation, optional processing algorithms.
- Information about the hearing status of a listener, e.g., the audiogram or other psychoacoustic measures.
- Information about the human recognition performance in a condition that shares any properties with the condition that is to be predicted.
The first category contains the information that is required to perform the test with human listeners and does not reveal its outcome when performed with a human listener. The second category usually reveals the individual recognition performance of acoustic stimuli which are not speech, and hence don't reveal too much about the speech recognition abilities. In fact, many of these can be performed by trained animals. The third category reveals the speech recognition performance of human listeners in a condition that can be similar to the condition in which the outcome is to be predicted. The similarity can often be found in one or more of the following aspects that affect speech intelligibility:
- Language
- Complexity (e.g., syllables, words, sentences)
- Talker (e.g., articulation, speed)
- Recording characteristics (e.g., bandwidth, microphone, encoding)
- Reverberation (e.g., office, living room, bathroom, hallway, church)
- Noise characteristics (e.g., stationary, fluctuating, speech)
- Signal processing (e.g., dynamic compression, noise reduction)
- Individual hearing status (e.g., audiogram)
Many of these interact with each other. For example, the same noise signal can have a different effects for different languages. Likeweise, the same signal processing can have a different effect for different noise signals. Further, the same talker is not available in many languages. A perfect model could predict the effects of all these parameters on the outcome of the corresponding speech test. When providing empirical SRT data to the model, many of these effects do not need to be predicted anymore because they were already measured.
In "single-ended" applications such as hearing aids, the acoustical input is not separated in speech and noise. Therefor the first category is not available because only mixed speech and noise signals are available.
A range of models exist that look at technical properties of the speech and the noise signal, determine from these the proportion of the speech signal which is accessable by a human listener and integrate this information to output a value between 0 ("unintelligible") and 1 ("perfectly intelligible"). The most prominent representatives of this class of models are the Articulation Index (AI), the speech transmission index (STI) and the Speech Intelligibility Index (SII). Several extensions and refinements exist which address different shortcomings of the original models, e.g., speech intelligibility in fluctuating noise maskers, speech intelligibility of binaural signals or speech intelligibility of non-linearly processed signals. Because these models do not perform the speech recognition task that human listeners have to perform, they are computationally not very expensive. However, their output is a value between 0 and 1, which still must be converted to an SRT for predicting the outcome of speech test. This can only be achieved by calibrating the index values with empirical reference data, e.g., an index value of 0.22 indicates 50% correct for specific speech test material. The use of empirically determined data in---possibily very---similar conditions to predict empirical data is problematic. No predictions for a single isolated condition can be performed. Only differences/changes from a reference conditions can be predicted for which empirical data is required. The predictions depend on the chosen reference conditions and the corresponding empirical SRTs. For example, which reference conditions should be chosen if the task is to predict the outcome in another language?
An approach to overcome this limitation is the direct simulation of the speech recognition test by means of an automatic speech recognizer (ASR), where the outcome of the simulated test can be used as the prediction. Usually, automatic speech recognizers do not perform as well as human listener in speech recognition with degraded signals. The reasons for this gap in performance are divers and not fully understood. One aspect is that ASR systems are generally "trained" to recognize speech in conditions different to those in which they are employed. This is simply because the talker, the microphone, and the background noise cannot be known in advance. While human listeners are incedibly good at adapting to an unknown talkers and noise conditions, ASR systems still have their problems. Fortunately, the goal in modelling speech intelligiblity is not to build a model that works in unknown conditions, but to predict the performance in known conditions. The knowledge of the exact speech and noise samples does not reveal any information about the empirical SRTs, in contrast to empirical reference SRTs. This is why for predictions it is advantegous to train the ASR system in the same condition than it is tested, which is also called "matched" training. Just like human participants of a matrix test perform a training routine to get used to the test setup, the ASR system is trained to become an equivalent knowledgeable listener. In fact, the performance of ASR systems with matched training is very close to the performance of human listeners in the matrix sentence test in many different conditions. The simulation framework for auditory discrimination experiments (FADE) implements all neccesary steps and was shown to accuratly predict empirical SRTs in many different conditions, including different noise masker, speakers, languages, hearing impairment, and even using hearing aid algorithms. For all these predictions, not a single empirical SRT had to be measured in advance, which is why this method is considered empirical reference-free.
We will generate the speech material for a toy test.
Invent a small new matrix speech test. Ideas:
- Digit triplets: One two one
- Simple sentences: Peter has spoons
- NATO Alphabet: Foxtrott Alfa Delta Echo
- ... whatever you like
Limit the number to 20 words at most (e.g, 3x6 or 4x5 matrix). Construct 30 sentences so that each word ocurrs equally often. Record the sentences with these words and name them with digits indicating the option. For example, for the digit triplet test:
"one - thee - five": 135.wav
Or for the NATO alphabet:
"Foxtrott - Alfa - Delta - Echo": 6145.wav
Use the program "sox" to normalize your recordings to -65dB full-scale (FS) sound level (make a backup):
sox --norm=-65 in.wav out.wav
If you don't like to type 30 times the same command, proceed as follows:
Run the following command in the folder where your recordings are stored, where you replace FOLDER with the name of the folder you created:
mkdir normalized; ls -1 *.wav | xargs -n1 -i{} sox --norm=-65 {} -r16000 -b32 normalized/{}
This "magic" command will create a folder "normalized", generate a list of all wav files, and then execute the sox command line by line performing normalization and downsampling in one go.
Consider recording the same sentences with two different speakers; who will be the more intelligible one?
FADE runs within an Ubuntu Linux environment and requires the installization of GNU/Octave interpreter (free alternative to Matlab) and the Hidden Markov Toolkit (HTK). To save you the time of setting up the software environment, you will use a USB-drive with a bootable Ubuntu Linux 17.10 that includes all the necessary packages.
Open a terminal (Strg+Alt+t) and type "fade".
A description of the command syntax will appear.
The structure is:
fade <PROJECT> <ACTION> [argument1] [argument2] ...
The square brackets indicate optional arguments for which sensible default values are defined.
First, you need to create an empty project for the prediction of the outcome of a matrix-style sentence tests. Please use the following command, where you replace PROJECT with the desired project name:
fade PROJECT corpus-matrix
A new directory named after your project and subdirectories with fundamental settings for the experiment will be generated. Feel free to browse through the folders and look into the files.
In the next step you need to configure the project to the define the experiment you want to perform.
The most impotant step is the deployment of the speech and noise signals. Locate the subdirectories "PROJECT/source/speech" and "PROJECT/source/noise". Copy the speech files to "PROJECT/source/speech" and the noise files to "PROJECT/source/noise".
Next, configure the project to use Mel-frequency cepstral coefficient (MFCC) features. Therefore, type 'fade-config' in the command line to see the description of the syntax:
fade-config <project> <action> [argument1] [argument2] [argument3] ...
The syntax is the same than for the 'fade' command, with the difference that 'fade-config' only changes the settings without performing the corresponding action, here the feature extraction.
Now, to configure the project to use Mel-frequency coefficients (MFCCs) features use:
fade-config PROJECT features mfcc
The feature extraction scripts will be copied to your project folder. Again, feel free to browse through the folders, locate the feature extraction scripts, and look into the files. If you know how to progam in Octave/Matlab you can modify the feature extraction. This can useful to implement the speech perception of listeners who are hard of hearing.
Now, several steps are required to predict an SRT. To facilitate the use of FADE, you can use a script which executes them in the correct order for you.
Type 'complete_project.sh' in the command line. The syntax is:
complete_project.sh PROJECT [START]
It will run the necessary steps to perform the simulation of a speech recognition test:
- First, The speech and noise files are mixed at several signal-to-noise ratios (SNRs) for training and testing.
- Then, features (here MFCC features) are extracted from the noisy recordings.
- Based on the training features, word models are trained with Gaussian Mixture and Hidden Markov Models.
- Subsequently, the trained models are used to recognize the speech in the testing features.
- After recognition, the recognition performance is evaluated for each training/testing SNR combination.
- The lowest SNR at which 50% of the words were correctly recognized is determined as the simulation outcome.
Each step will leave its results in a subdirectory in you project folder. Please, take a few minutes to explore the contents of each folder. You will find some figures and the outcome of your experiment in the 'figures' subdirectory.
These will help you to generate presentable figures.
The speech and noise signals are mixed at different SNRs and stored in the 'corpus' folder.
Listen to some of them and find three examples of the same sentence with different intelligibility.
Copy them to a separate folder for this exercise.
Also copy the same sentence from the "source" folder to the separate folder.
Calculate the log Mel-spectrogram, which is the basis from which the MFCC feature are extracted. You can find the GNU/Octave function "log_mel_spectrogram.m" to calculate it in the folder "~/fade/fade.d/features.d/standalone". Copy it to the same folder than the audio files.
Write a script which reads the audio files, calculates the log Mel-Spectrograms, and plots them one over each other.
- Try to identify typical patterns of speech (e.g., vowels, fricatives, ...)
- What happens in the less intelligibile recordings? Compare their SNRs.
Locate the feature files (.htk) in the 'features' folder which correspond to the files you analysed before and copy them to a separate folder. You can find a GNU/Octave script "readhtk.m" to read these files in the folder "~/fade/scripts/run-matlab.d/features/" (copy it). Load a feature-file (.htk) with a high SNR with Octave and the function 'readhtk.m'.
Write a script which reads the feature files and plots their content one over each other.
- Compare the features to the log Mel-spectrograms
- Calculate the inverse discrete cosine transformation with "idct(features)"
Look at the figure in the 'figures' folder.
- Which is the lowest SRT that could be achieved, and at which training SNR?
- What is a psychometric function? How is it measured in hearing tests.
- Compare the SRTs with the stationary and the fluctuating interferer
- Compare the SRTs with the different speech material (only, if you have two tests)
- Could you design recordings which would simulate a tone-in-noise detection experiment?
FADE is free software: https://github.com/m-r-s/fade
The README file gives more information about each processing step: https://github.com/m-r-s/fade/blob/master/README.md
There are tutorials in the "~/fade/tutorials" folder or on-line: https://github.com/m-r-s/fade/tree/master/tutorials