GLUE is a lightweight, Python-based collection of scripts to support you at succeeding with speech and text use-cases based on Microsoft Azure Cognitive Services. It not only allows you to batch-process data, but also glues together the services of your choice in one place and ensures an end-to-end view on the training and testing process.
GLUE consists of multiple modules, which either can be executed separately or ran as a central pipeline:
- Batch-transcribe audio files to text transcripts using Microsoft Speech to Text Service (STT).
- Batch-synthesize text data using Microsoft Text to Speech Service (TTS).
- Batch-evaluate reference transcriptions and recognitions.
- Batch-score text strings on an existing, pre-trained Microsoft LUIS-model.
WIP:
- Batch-translate text data using Microsoft Translator.
The toolkit has been built based on our experience from the field and is a great add-on, but not limited, to the following use-cases:
- Automatized generation of synthetic speech-model training data.
- Batch-transcription of audio files and evaluation given an existing reference transcript.
- Scoring of STT-transcriptions on an existing LUIS-model.
Based on our experience in the field, we see that often a larger corpus of training data is needed for STT engines and LUIS models. The correct transcription and recognition of entities such as cities and names often play a significant role. Sometimes there is a lack of training data here, which is why we have created a small tool for duplicating utterances based on different entity types. If this also applies to your use-case, then we recommend taking a look at this Jupyter Notebook, which will guide you through the necessary steps.
This section describes how you get started with GLUE and which requirements need to be fulfilled by your working environment. If you would like to add your own features and maintain the code in a private repository, you may use the Template function (hit the green button Use this template on the top left).
Before getting your hands on the toolkits, make sure your local computer is equipped with the following frameworks and base packages:
- Python (required, Version >=3.8 is recommended).
- VSCode (recommended), but you can also run the scripts using PowerShell, Bash etc.
- Stable connection for installing your environment and scoring the files.
- ffmpeg for audio file conversion (only for TTS use cases).
- Open a command line of your choice (PowerShell, Bash).
- Change the directory to your preferred workspace (using
cd
). - Clone the repository (alternatively, download the repository. as a zip-archive and unpack your file locally to the respective folder).
git clone https://github.com/microsoft/glue
- Enter the root folder of the cloned repository.
cd glue
- Set up the virtual environment.
python -m venv .venv
- Activate the virtual environment.
# Windows:
.venv\Scripts\activate
# Linux:
.venv/bin/activate
- Install the requirements
pip install -r requirements.txt
- (optional) If you want to use Jupyter Notebooks, you can register your activated environment using the command below.
python -m ipykernel install --user --name glue --display-name "Python (glue)"
After successfully installing the requirements-file, your environment is set up and you can go ahead with the next step.
In the root directory of the repository, you can find a file named config.sample.ini
. This is the file where the API keys and some other essential confirguation parameters have to be set, depending on which services you would like to use. First, create a copy of config.sample.ini
and rename it to config.ini
in the same directory. You only need the keys for the services you use during your experiment. However, keep the structure of the config.ini
-file as it is to avoid errors. The toolkit will just set the values as empty, but will throw an error when the keys cannot be found at all.
An instruction on how to get the keys can be found here.
This section describes the single components of GLUE, which can either be ran autonomously or, ideally, using the central orchestrator.
glue.py
- Central application orchestrator of the toolkit.
- Glues together the single modules in one place as needed.
- Reads input files and writes output files.
stt.py
- Batch-transcription of audio files using Microsoft Speech to Text API.
- Allows baseline models as well as custom endpoints.
- Functionality is limited to the languages and locales listed on the language support page.
tts.py
- Batch-synthetization of text strings using Microsoft Text to Speech API.
- Supports Speech Synthesis Markup Language (SSML) to fine-tune and customize the pronunciation, as described in the documentation.
- Retrieves high-quality audio file from the API and converts it to the Microsoft speech format as well as a version underlaid by the noise of a telephone line.
- Functionality is limited to the languages and fonts listed on the language support page.
- Make sure the voice of your choice is available in the respective Azure region (see documentation).
luis.py
- Batch-scoring of intent-text combinations using an existing LUIS model.
- See the following quickstart documentation in case you need some inspiration for your first LUIS-app.
- Configureable scoring treshold, if predictions only want to be accepted given a certain confidence score returned by the API.
- Writes scoring report as comma-separated file.
- Returns classification report and confusion matrix based on scikit-learn.
evaluate.py
- Evaluation of transcription results by comparing them with reference transcripts.
- Calculates metrics such as Word Error Rate (WER), Sentence Error Rate (SER), Word Recognition Rate (WRR).
- Implementation based on github.com/belambert/asr-evaluation.
- See some hints on how to improve your Custom Speech accuracy.
params.py
- Collects API and configuration parameters from the command line (ArgumentParser) and the
config.ini
.
helper.py
- Collection of helper functions which do not have a purpose on their own, rather complementing the orchestrator and keeping the code neat and clean.
- In case there is a need for custom components, we recommend to add them to this module.
- Creates folder for every run using the naming convention
YYYYMMDD-[unique ID]
.
The following table shows and describes the available modes along with their input parameters as well as dependencies.
Mode | Command line parameter | Description | Dependencies |
---|---|---|---|
TTS | --do_synthesize |
Activate text-to-speech synthetization. | Requires csv file with text -column, see --audio_files . |
STT | --do_transcribe |
Activate speech-to-text processing. | Requires audio files, see --audio_files . |
STT | --audio_files |
Path to folder with audio files. | Audio files have to be provided as WAV-file with the parameters described here. |
STT-Evaluation | --do_evaluate |
Activate evaluation of transcriptions based on reference transcriptions. | Requires csv-file with text -column and intent names. |
LUIS | --do_scoring |
Activate LUIS model scoring. | Requires csv-file with intent (ground truth of LUIS intent) and text columns (max. 500 characters due to LUIS limitation, gets cut if > 500 characters. |
STT / TTS | --input |
Path to comma-separated text input file. |
Depending on your use-case described in the table above, you may have to provide an input text file and/or audio files. In these cases, you have to pass the path to the respective input file of folder via command line. Following guidelines exist for these input files:
- Comma-separated values (csv) file (if you only have an Excel sheet, you can export it as csv-file: (Save as -> CSV UTF-8 (Comma delimited (*.csv))).
- UTF-8 encoding (to make sure it has the correct encoding, open it with a text editor such as Notepad++ -> Encoding -> Convert to UTF-8).
- Column names (depending on the module it may be a mix of the following:
text
(utterance of the text, max length of 500 characters) and/orintent
(ground-truth LUIS-intent) and/oraudio
). - We recommend you to put the input file in a subfolder called
input
, but you can choose arbitrary here.
You can find an example text files as well as example audio files following the respective links.
GLUE creates multiple folders and files of different types, depending on the modes you want it to run. The overview table below shows you which folders and files it may cover. Folders end with a /
, files wend with a file ending (e.g. .csv). The X in the respective mode columns indicate, given which mode the output files and folders are created.
File / Folder | STT | TTS | LUIS | Eval | Comment |
---|---|---|---|---|---|
luis_scoring.csv | X | Comma-separated file with audio file names and transcriptions. | |||
stt_transcriptions.txt | X | Tab-delimited file with audio file names and transcriptions. | |||
tts_transcriptions.txt | X | Tab-delimited file with audio file names and transcriptions. | |||
tts_transcriptions.csv | X | Comma-separated file with audio file names and transcriptions. | |||
transcriptions_full.csv | X | X | Comma-separated file with merged columns of the current run. | ||
input/ | X | X | X | Folder with duplicate of input file. | |
tts_converted/ | X | Folder with TTS-results in Microsoft Speech-optimized format. | |||
tts_generated/ | X | Folder with raw, high-quality TTS-results. | |||
tts_telephone/ | X | Folder with TTS-results in Microsoft Speech-optimized format, underlaid with telephone line sound. |
The following section describes how to run the individual modules via the orchestrator.
This scenario describes how you can batch-transcribe audio files using GLUE. A potential use case can be that you do not have reference transcriptions to the audio files yet and want to accelerate the transcription-process, by "pre-labeling" the data. The recognitions might not be perfect, but it helps you to have a much better time by providing a starting point.
- Azure Speech Service resource (see Get Your Keys).
- Audio files in .wav-format in a separate folder, as all wave files in the directory will be collected.
- See example audio files here.
cd
to the root folder of GLUE.- Make sure your
.venv
is activated. - Run the following command:
python src/glue.py --audio assets\examples\input_files\audio --do_transcribe
- Wait for the run to finish.
GLUE will create an output folder as below:
- case: [YYYYMMDD]-[UUID]/
| -- stt_transcriptions.txt (tab-delimited file with audio file names and transcriptions)
This scenario describes how you can batch-synthesize text data using GLUE. A potential use case can be that you want to create synthetic training data for your speech model, as you do not have enough speakers to create acoustic training material. The use case may be callcenter-related, which is why you need some tweaked data in order to simulate a realistic setup.
- Azure Speech Service resource (see Get Your Keys).
- Textual, comma-separated input file with a
text
column and utterances to be synthesized. - See an example input file here.
cd
to the root folder of GLUE.- Make sure your
.venv
is activated. - Run the following command:
python src/glue.py --input assets\examples\input_files\example_tts.csv --do_synthesize
- Wait for the run to finish.
GLUE will create an output folder as below:
- case: [YYYYMMDD]-[UUID]/
| -- input/
| -- [input text file].csv
| -- tts_converted
| -- [converted audio files]
| -- [converted audio files]
| -- tts_generated
| -- [generated audio files]
| -- [generated audio files]
| -- tts_telephone
| -- [modified audio files]
| -- [modified audio files]
| -- tts_transcriptions.txt (tab-delimited file with audio file names and transcriptions)
| -- tts_transcriptions.csv (comma-separated file with audio file names and transcriptions)
This scenario shows how you can use GLUE to batch-score textual data on a LUIS-endpoint.
- LUIS app and the respective keys (see Get Your Keys).
- If you do not have a LUIS app yet, you can use our example LUIS app for flight bookings and import it to your resource.
- Textual input file with an
intent
ANDtext
column. - See an example input file here.
cd
to the root folder of GLUE.- Make sure your
.venv
is activated. - Run the following command:
python src/glue.py --input assets\examples\input_files\example_luis.csv --do_scoring
- Wait for the run to finish and see the command line outputs.
GLUE will create an output folder as below:
- case: [YYYYMMDD]-[UUID]/
| -- input/
| -- [input text file].csv
| -- luis_scoring.csv (comma-separated file with audio file names and transcriptions)
In your command line, you will see a print of a confusion matrix as well as a classification report. These reports cannot be written to a file, however they are based on the output file luis-scoring.txt
and you may use the Jupyter notebook to load them again and gain deep insights into the classification performance. The following table shows the structure of the scoring file luis_scoring.csv
and provides you an example how predictions are handled, also when the confidence score is low.
intent | text | prediction_text | score_text | prediction_drop_text |
---|---|---|---|---|
Intent name based on reference Excel file | Raw text string | Predicted intent by LUIS app | Certainty score of LUIS model, between 0 and 1 | Predicted intent by LUIS app, None-intent in case of dropped value (when below the confidence score e.g. of 0.82) |
BookFlight | I would like to book a flight to Frankfurt. | BookFlight | 0.9450068 | BookFlight |
CancelFlight | I want to cancel my journey to Kuala Lumpur. | CancelFlight | 0.8340548 | CancelFlight |
ChangeFlight | I would like to change my flight to Singapore. | ChangeFlight | 0.9112311 | ChangeFlight |
ChangeFlight | I would like to book a seat on my flight to Stuttgart. | BookFlight | 0.5517158 | None |
This scenario describes how you can compare already existing recognitions with a ground-truth reference transcription using GLUE. A potential use case can be that you want to assess the quality of your speech model and figure out potential recognition problems, which you may counteract by custom model training. In this case, you have to provide already existing recognitions to the tool.
- Azure Speech Service resource (see Get Your Keys).
- Textual input file with an
text
column with reference transcriptions as well as arec
column with recognitions. - See an example input file here.
cd
to the root folder of GLUE.- Make sure your
.venv
is activated. - Run the following command:
python src/glue.py --input assets\examples\output_files\example_transcriptions_full.csv --do_evaluate
- Wait for the run to finish and see the command line outputs.
GLUE will create an output folder as below:
- case: [YYYYMMDD]-[UUID]/
| -- input/
| -- [input text file].csv
| -- transcriptions_full.csv (comma-separated file with merged columns of the current run)
In your command line, you will see an output of the evaluation algorithms per sentence as well as an overall summary in case you used multiple utterances. These evaluations are based on the text
and rec
columns. As the command line output is only temporary, we recommend you to use the Jupyter Notebook resulting from this run.
This scenario describes how you can batch-transcribe audio files and compare these recognitions with a ground-truth reference transcription using GLUE. A potential use case can be that you want to assess the quality of your speech model and figure out potential recognition problems, which you may counteract by custom model training.
- Azure Speech Service resource (see Get Your Keys).
- Audio files in .wav-format in a dedicated folder, as all wave files in the directory will be collected
- Textual input file with an
audio
column for reference audio file names AND the respectivetext
column with reference transcriptions. - See an example input file here and example audio files here.
cd
to the root folder of GLUE.- Make sure your
.venv
is activated. - Run the following command:
python src/glue.py --audio assets\examples\input_files\audio --input assets\examples\input_files\example_stt_eval.csv --do_transcribe --do_evaluate
- Wait for the run to finish and see the command line outputs.
GLUE will create an output folder as below:
- case: [YYYYMMDD]-[UUID]/
| -- input/
| -- [input text file].csv
| -- stt_transcriptions.txt (tab-delimited file with audio file names and transcriptions)
| -- transcriptions_full.csv (comma-separated file with merged columns of the current run)
This scenario describes how you can batch-transcribe audio files, compare these recognitions with a ground-truth reference transcription and score both version on a LUIS-endpoint using GLUE. A potential use case can be that you want to assess the quality of your speech model, figure out potential recognition problems and also compare the impact on a LUIS model using STT in between.
- Azure Speech Service resource (see Get Your Keys).
- LUIS app and the respective keys (see Get Your Keys).
- If you do not have a LUIS app yet, you can use our example LUIS app for flight bookings and import it to your resource.
- Audio files in .wav-format in a dedicated folder, as all wave files in the directory will be collected.
- Textual input file with an
audio
column for reference audio file names ANDintent
for the LUIS class AND the respectivetext
column with reference transcriptions. - See an example input file here and example audio files here.
cd
to the root folder of GLUE.- Make sure your
.venv
is activated. - Run the following command:
python src/glue.py --audio assets\examples\input_files\audio --input assets\examples\input_files\example_stt_eval_luis.csv --do_transcribe --do_evaluate --do_scoring
- Wait for the run to finish and see the command line outputs.
GLUE will create an output folder as below:
- case: [YYYYMMDD]-[UUID]/
| -- input/
| -- [input text file].csv
| -- luis_scoring.csv (comma-separated file with audio file names and transcriptions)
| -- stt_transcriptions.txt (tab-delimited file with audio file names and transcriptions)
| -- transcriptions_full.csv (comma-separated file with merged columns of the current run)
In your command line, you will see a print of a confusion matrix as well as a classification report. These reports cannot be written to a file, however they are based on the output file luis-scoring.txt
and you may use the Jupyter notebook to load them again and gain deep insights into the classification performance. The following table shows the structure of the scoring file luis_scoring.csv
and provides you an example how predictions are handled, also when the confidence score is low.
Compared to scenario 3, where only the reference text was scored, the luis-scoring.txt
-file will have more columns this time. Each of the columns prediction_
, score_
, prediction_drop_
will have a text
(reference transcript) and rec
(recognition) version. This is supposed to help you evaluating the differences in LUIS recognition given different inputs. Intent predictions and scores may differ given irregularities in transcribing audio files. We invented some transcription issues, which you see below, highlighted in column rec.
intent | text | rec | prediction_{text/rec} | score_{text/rec} | prediction_drop_{text/rec} |
---|---|---|---|---|---|
Intent name based on reference Excel file | Raw text string | STT recognition | Predicted intent by LUIS app | Certainty score of LUIS model, between 0 and 1 | Predicted intent by LUIS app, None-intent in case of dropped value (when below the confidence score e.g. of 0.82) |
BookFlight | I would like to book a flight to Frankfurt. | I would like to book a fight to Frankfurt. | BookFlight | 0.9450068 | BookFlight |
CancelFlight | I want to cancel my journey to Kuala Lumpur. | I wanna cancel my journey to kualalumpur. | CancelFlight | 0.8340548 | CancelFlight |
ChangeFlight | I would like to change my flight to Singapore. | Would like to change mum flight to Singapore. | ChangeFlight | 0.9112311 | ChangeFlight |
ChangeFlight | I would like to book a seat on my flight to Stuttgart. | I would like to book a suite on my flight to Stuttgart. | BookFlight | 0.5517158 | None |
This scenario describes how you can batch-transcribe audio files and score both version on a LUIS-endpoint using GLUE. A potential use case can be that you want to assess the quality of your LUIS model using STT as a reference and in case you do not have a reference transcription. However, you need an intent
column for every input audio file.
- Azure Speech Service resource (see Get Your Keys).
- LUIS app and the respective keys (see Get Your Keys).
- If you do not have a LUIS app yet, you can use our example LUIS app for flight bookings and import it to your resource.
- Audio files in .wav-format in a dedicated folder, as all wave files in the directory will be collected.
- Textual input file with an
audio
column for reference audio file names ANDintent
for the LUIS class. - See an example input file here.
cd
to the root folder of GLUE.- Make sure your
.venv
is activated. - Run the following command:
python src/glue.py --audio assets\examples\input_files\audio --input assets\examples\input_files\example_luis.csv --do_transcribe --do_scoring
- Wait for the run to finish and see the command line outputs.
GLUE will create an output folder as below:
- case: [YYYYMMDD]-[UUID]/
| -- input/
| -- [input text file].csv
| -- luis_scoring.csv (comma-separated file with audio file names and transcriptions)
| -- stt_transcriptions.txt (tab-delimited file with audio file names and transcriptions)
| -- transcriptions_full.csv (comma-separated file with merged columns of the current run)
In your command line, you will see a print of a confusion matrix as well as a classification report. These reports cannot be written to a file, however they are based on the output file luis-scoring.txt
and you may use the Jupyter notebook to load them again and gain deep insights into the classification performance. The following table shows the structure of the scoring file luis_scoring.csv
and provides you an example how predictions are handled, also when the confidence score is low.
Compared to scenario 3, where only the reference utterances were scored, the columns prediction_
, score_
, prediction_drop_
of luis_scoring.csv
-file will end with rec
(recognition) instead of text
(reference transcript) in this case.
intent | rec | prediction_rec | score_rec | prediction_drop_rec |
---|---|---|---|---|
Intent name based on reference Excel file | STT result | Predicted intent by LUIS app | Certainty score of LUIS model, between 0 and 1 | Predicted intent by LUIS app, None-intent in case of dropped value (when below the confidence score e.g. of 0.82) |
BookFlight | I would like to book a fight to Frankfurt. | BookFlight | 0.8750068 | BookFlight |
CancelFlight | I want to cancel my journey to kualalumpur. | CancelFlight | 0.7140548 | None |
ChangeFlight | Would like to change my flight to Singapore. | ChangeFlight | 0.8992311 | ChangeFlight |
ChangeFlight | I would like to book a suite on my flight to Stuttgart. | BookFlight | 0.3917158 | None |
We can see that the recognition scores have decreased, so this may have an impact on the overall recognition rate.
This toolkit is the right starting point for your bring-your-own data use cases. However, it does not provide automated training runs and does not ensure an improvement of the performance on your task. It helps you to do end-to-end testing and gain the right insights on how to improve the quality on your use-case.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.