G2: Voice Controlled Thumper

GitHub Link & Authors

https://github.com/YahiaK98/Embedded_Project2

The three members participating in the development of this project are:

Ahmed Leithy 900160088
Yahia Khaled 900161331
Yousef Koura 900160083

Project Idea and General Flow

Our idea for the project is to develop a voice controlled thumper that would be able to listen to and recognize the following four simple commands:

Go
Stop
Right
Left

to which it would be able to react accordingly. The system would utilize a microphone to receive the input sound samples and convert it to a digital signal (using an ADC) which, after preprocessing, becomes the input of a machine learning model (microspeech) running on the STM Nucleo Board designed to recognize the commands. Based on the command received, the microcontroller sends UART commands to the TReX to move the thumper accordingly (in a similar fashion to lab 4). The following diagrams illustrate this process:

Flow Diagram

Software Used

µVision IDE - Keil (For the testing of the Sound Detector)
STM32 Cube IDE

Hardware Used

Microphone (Sound Detector we used)
STM32L432KC
Dagu Thumper
Polulu TReX

Useful Reference Links

Development Process

The Thumper

Programming the thumper was the most straightforward part of the development process as we were able to use the code that we had written for experiment of the lab. After recreating the connections we had made for the experiment, we proceeded to simply modularize the code such that we would have one function for each of the commands (E.g. A function that turns left). We had previously performed the calculations for such commands in the lab, but we needed to empirically fine-tune them so that they would work well on our specific thumper.

The code providing the basic functionality of moving the thumper can be found here.

Using the Microphone

We needed to test a microphone's ability to produce digital samples in order to be able to integrate them into the rest of the system. These digital signals would later be used to produce spectrograms, which is the input to our microspeech model. To that end, we tested three different microphone modules, and ended up using the last one as it was the one that worked most intuitively and had the best documentation. We tested the microphones by simply connecting them to an STM Nucleo Board (through the on board ADC if necessary), and then displayed the value of the signal on Tera Term by sending it through UART. This meant that a low sampling rate would be needed in order to see the output, and using polling was simply sufficient for this phase of testing (about 1KHz). This is, of course, not nearly enough of a sampling rate to recreate the sound signal as a minimum 8K sampling rate is usually used, and the micro speech sample program itself uses a 16KHz sampling rate; but it was sufficient for testing to see whether the microphones are working as expected. Perhaps it would have been more intuitive, as suggested by Dr. Shalan, to form a kind of wave signal out of the samples for examination; either by using the tools provided by the µVision IDE, or by saving the values to an excel document and then producing a graph out of them.

As previously mentioned, the Sound detector (last microphone) was the one that was most appropriate given the documentation and time constraints, but we will be detailing each of the attempts as well as the issues we faced when attempting to use each of the modules.

Electret Microphone

The first type of microphone we attempted to use is the electret microphone, which can be easily obtained from any electronics store for a cheap price. For reference, this is what the microphone looks like:

Electret

When trying to use this microphone, we faced several problems which are outlined below.

Capacitor and resistor values: since this is a simple microphone that you can interface with directly using its two pins, we had to use the appropriate resistor and capacitor values to procure the proper functioning circuit. Deciding on these values was an annoyance, as we referred to several sources and experimented with different values. Those sources included circuits found on the internet, and consulting the salesman from the electronics shop we bought the microphone from. None of the combinations seemed to provide reasonable results.
The need for an amplifier: after connecting the circuit and outputting the values produced on TeraTerm through UART, we noticed that the value barely changes when we speak into the microphone. This was probably due to the fact that we needed some signal conditioning to amplify the output.

As it can be seen, using this microphone adds large overheads which could easily be avoided. Utilizing an audio sensing board instead already handles all the aforementioned points, as well as sometimes providing extra functionality such as an on-board ADC. Therefore, we do not recommend using this microphone if any similar project is attempted in the future.

Digilent Microphone

Our second attempt involved using the microphone provided by the CSCE workshop at AUC: the Digilent Pmod MIC. For reference, the exact version of the microphone that we used is shown below.

Digilent

This board provided a lot of helpful features unlike the previous mic, as it had an on-board adc and a dynamic range compressor. However, after interfacing it with the microcontroller via the SPI protocol and observing the output, we noticed it was very noisy. There also seemed to be a variable DC value present in the signal that was not getting shunted. On top of that, upon inspecting the microphone's datasheet online to try to understand it better, we were met with an extremely brief datasheet that did not provide much useful information. This made it hard as we did not have something to refer back to, and as such, we had no leads to follow, other than the datasheet for their on board ADC. Therefore, we started browsing the internet in hopes of finding a better documented, and better performing microphone. In our final experiment, the board was producing around ~0.75 volts without any input sound, which is not good as we expect the output to hover around the middle range of the adc (1.65v). It's sensitivity to sound was also weak, meaning that it would not provide high resolution audio.

For more information about this microphone, please visit this Link.

Sound Detector

Our final attempt and the microphone we ended up going with is this one. Although this one was a lot more expensive compared to the electret mics, and there was no on-board ADC, we found this one to be performing a lot better and easier to use. It has an analog output as well as a digital signal indicating whether there is sound present in the environment. After connecting it with the microcontroller's ADC, we were able to observe the output. The output on TeraTerm was fluctuating around the 1.6V mark, which makes sense since we had it connected to the 3.3V. When speaking into the mic, the values would oscillate in a sine wave. With reference to the HAL Datasheet, we were able to read the ADC's values at a rate of 16KHz, which provides sufficient quality for an audio signal and happens to be the sampling rate used by the spectrograms utilized for training our model.

Final Setup: 16 KHz sampling rate when prompted by Noise

For our final setup, we used the Digital Noise signal as a trigger to start the sampling process. We would sample the external audio for 1 second (as used in the training wav files) at a rate of 16KHz, thus producing 16K samples. These 16K samples would be the input to the microspeech model after being transformed into a spectrogram by the MicroSpeech code. Unfortunately we faced many issues with the step of transforming the audio samples into a spectrogram; although the examples provide the code for the process, it was not producing the correct output thus heavily influencing the model's predictions.

In an attempt to check whether this was an issue with the mic; we performed the following three checks:

We used the sample audio input provided with the MicroSpeech example (without our mic's involvement), relying on a singular array of samples with "yes" and "no" commands. The correctness of these arrays were double checked by turning them back into .wav files, which produced a sound of a man speaking the command. However, this still did NOT produce the correct spectrograms, leading us to believe that there is an issue with that section of the code.
We displayed the output of the mic on a graph by recording the inputs to an excel file and then graphing them. This produced a reasonable graph with waves around the beginning of the samples (where we spoke), and then a section hovering around the middle value of the y-axis when there was silence. We were not able to recognize any abnormalities.
As will be discussed shortly, we used spectrograms from the training (produced from python code during training) on the microcontroller; which produced perfect predictions as expected. This lead us to believe that there is an issue with the code that converts Sound Samples to Spectrograms.

Developing the Machine Learning Model

Using the Sin Model

We first wanted to test running a simple model on the microcontroller, to gain experience with the process of deploying a model on the STM32L432KC board and producing output. To that we end, and with reference to this tutorial, we produced a sin wave model capable of predicting the value of sin(x), given the variable x as an input parameter. The outputs are shown in the following screenshots (sent over UART):

SinModel

The input of the model is stated at the top of each screenshot; the outputs (repeated by row) are the predictions of the model and the number after the Duration is time taken for the model to compute the prediction (in μs). While the tutorial does a good job of explaining most of the steps required to produce this output, please take note that the function used (in the tutorial) to add a fully connected layer into the model is deprecated. As such, please replace the following function:

tflite_status = micro_op_resolver.AddBuiltin(
       tflite::BuiltinOperator_FULLY_CONNECTED, 
       tflite::ops::micro::Register_FULLY_CONNECTED());

with the following single line:

tflite_status = micro_op_resolver.AddFullyConnected();

Also it is recommended to use Linux for building the project as most commands are compatible with UNIX systems.

The Speech Recognition Model - MicroSpeech

Now with more experience with using TFLite Models on microcontrollers, we started working on finding a machine learning model to use for speech recognition. This model would need to recognize specific voice commands that would then be mapped to the modular control functions developed for moving the thumper vehicle. We found a model, called MicroSpeech, that is trained on recognizing the words "yes" and "no" developed by TensorFlow. Luckily, this model uses Google's speech recognition dataset for training; consequently, it can be retrained to recognize any words found in said dataset.

We found a Google Colab notebook that allows us to manually train the model to fit our needs. Notably, we set the "WANTED_WORDS" parameters to "go,stop,right,left", so that it can recognize commands representative of the vehicle's movement. After training the model, we had to freeze it, convert it to a tflite model file, and then convert that into a c buffer; this c buffer would then be loaded and used for inference on the microcontroller. These conversions can all be accomplished through functions that are available inside the notebook. Furthermore, the notebook also contains cells for testing the model against test data in the Google speech recognition dataset; after testing it, we found that our model had an accuracy of ~88% for 6-class classification (namely: silence, unknown, go, stop, right, left).

When it came to actually integrating our newly trained model onto our microcontroller, we found that we had 3 approaches to choose from (based on more research that was conducted and suggestions from the instructor and colleagues); namely, these were:

Writing the model by hand
Using X-Cube-AI interface
Using TFLite's example project

The first approach would be to manually do what was done by the writer of the sin model tutorial previously discussed. This approach is not recommended by the Tensorflow Lite for Microcontrollers documentation and would've required that we manually define the model from scratch; due to the inefficiency of this approach, we elected to ignore it. That led us to the next approach, using X-Cube-AI.

X-Cube-AI

X-Cube-AI is a package that can be downloaded and installed on STM32CubeMX. It allows the loading of .tflite or .h5 model files and automatically integrates them into the project files when code is generated using STM32CubeMX. The package can be activated and configured like any peripheral on CubeMX through the UI. Once it's activated, a model can be loaded from the file browser directly. A big advantage of using X-Cube-AI is that it provides a model analysis functionality; this allows us to find information such as how much SRAM and FLASH memory is required for the model, which is really helpful for such applications where we have to deal with the limited memory of a microcontroller. After analyzing our model, we found that it would take ~24KB of the FLASH and ~6KB of SRAM which should fit easily in our STM32L432KC board; this is essential, as there needs to be enough space for all the data processing functionalities that would also need to be on-board. Once the model is loaded, we can generate the project. One of the great advantages of using X-Cube-AI is that it automatically generates and abstracts all the model-related code, leaving only the inputs and outputs of the model to the developer. To that end, the developer would need to provide inputs to the model in an input tensor and extract the output from the model in an output tensor; the format of the input and output tensor is detailed in the X-Cube-AI documentation.

To get familiar with this new package, we decided to re-do the sin model project using it (as it is a simple model with simple inputs and outputs). We learned how to utilize the input tensors from [this reference] and were able to input float values to the model in this fashion:

for (uint32_t i = 0; i < AI_SINE_MODEL_IN_1_SIZE; i++)
{
   ((ai_float *)in_data)[i] = (ai_float)5.0f; // input the value of 5 into the model
}

After testing the model with multiple values, we found that it produced correct rough estimates of the sin value at those points (matching the results previously described when the model was implemented). We then tried out loading the MicroSpeech model; here, we faced a problem in that the model input were 49x40 spectrograms. Due to the abstracted nature of the X-Cube-AI generated project, integrating these complex inputs into it proved to be very complicated and was taking too much time; furthermore, the process for extracting multiple outputs from the model (as the model would produce score for each class) wasn't described clearly and thus proved to be difficult as well. Consequently, we decided to abandon using the package as it was taking more time than it was saving for us. While we didn't end up using it for this application, we would recommend using it for an application that uses simple inputs as it simplifies the process greatly. Additionally, its model analysis functionality is useful on its own, as knowing the memory requirements of the model is essential in such projects; therefore, we would recommend installing it and using it for model analysis even if the automatic application functionality is not needed. Some screenshots showing the process of adding the package and using the analysis feature are shown below.

TFLite for Microcontrollers Example Project

Finally, we decided to follow the recommended way of utilizing the model: using the example projects provided by the TensorFlow Lite for Microcontrollers repo. This is usually done by using the provided makefile to generate the project directory and manually editing, compiling, and flashing the code onto the microcontroller (using MBED, for example). Since we were more used to the process of using STM32CubeMX and Keil µVision to write the code and flash on the board, we started searching for a way to use it for our application.

Our research resulted in finding an IDE called STM32CubeIDE, which provides a full IDE (using the Eclipse environment) with STM32CubeMX integrated directly within. The advantage of STM32CubeIDE is allows us to formulate the whole project directory within it in a simpler fashion than Keil µVision and it has CubeMX natively integrated (allowing for faster edits to the project code). We cloned the TensorFlow repo and started by adding the MicroSpeech example project directory into our STM32CubeIDE project. Here, we faced a big problem: STM32CubeIDE uses virtual folders and thus we can't just directly copy the whole directory into the IDE; furthermore, any new folders made inside the IDE needed to be manually added to the includes in the preferences in the project. By default, the IDE uses two folders, inc and src, to hold its headers and c/cc files respectively; to that end, we decided to copy all needed headers into the inc folder and all needed c/cc files into the src folder. This caused another problem: all the includes in the files used the full path of each file; this meant that we had to go into every file and correct the path to match the new directory structure. Once we added all the MicroSpeech files and fixed their includes, we found that they had many dependencies on other files in the TensorFlow repo that weren't inside the MicroSpeech directory. Consequently, we had to collect those files from the different directories and integrate them into the IDE in the same way. After slogging through this grueling process, we finally had the full MicroSpeech example project integrated in our STM32CubeIDE project! The whole project can be found in our Github repo linked at the top of this page.

Using MicroSpeech

With everything in place, we now needed to add the model we trained into this project. We had already converted the model into a c buffer, so integrating into our project was very simple. All we had to do was replace the model.cc file provided with our own version of it; additionally, we had to modify the micro_model_settings header and c file to have the correct class labels and number of classes. The next step was to set up the model and use it to run inference; to that end, we adapted the main_functions.cc file found in the MicroSpeech repo directory. This file provides the needed set up functions for the model and implements the different components needed to process the input and interpret the output. A simplified version of the pipeline is shown below:

One thing to note is that "micro features" are essentially the spectrograms in this context. It should also be noted that the audio provider shown above is supposed to be replaced with an implementation that takes input from a mic; however, we wanted to ensure that the pipeline was functional before introducing the mic into the fray as another potential cause for errors. To that end, we initially used sample input buffer provided by the repo that contained 16000 samples of a "yes" audio file (1000ms audio file sampled at 16KHz). To use these samples, we utilized a mock audio provider file that got its samples from the aforementioned input buffer in place of the mic audio provider in the pipeline. As mentioned at the end of the "using the microphone" section, we encountered a lot of trouble with getting these audio samples to work inside our pipeline; additionally, we went through multiple stages of ensuring the correctness of the audio samples (whether the ones provided by the example or the ones acquired through the mic), also as described in the "using the microphone" section. Since we were now quite certain that the issue was not in the audio samples themselves, we turned our focus onto the spectrograms.

To test the spectrograms, we were able to find "golden standard" spectrograms for the words "yes" and "no"; these spectrogram were saved as a c buffer ready to be used directly in code. Next, we trained a model on those words and replaced our model with it in our project. We then tested the model by giving the spectrograms to the model input directly (essentially bypassing the first three stages in the pipeline) and printed out the scores for each class on a UART terminal (took the output from step 6 and stopped the pipeline). We checked the output for both the "yes" and "no" spectrograms, and found that it gave the correct class a very high score each time (it should be noted that the threshold for recognizing a command is that it have a score >= 200). We then decided to test our previous model (trained to recognize go, stop, right, and left). We found that the "yes" input produces a "left" guess with a somewhat high score (which makes sense since the two words are somewhat similar); on the other hand, a "no" input produced a "go" guess with a very high score (which makes sense since the two words are very close phonetically). All of these are attached in an image below. With this, we now knew that the problem was in the process of producing a correct spectrogram from the audio input. We spent many days trying to get that to work; however, the conversion code was implemented by the TFLite repo and was not documented well enough. Moreover, the code was very tightly coupled and contained classes from many different libraries; this made it basically impossible to locate the exact issue.

To that end, we decided to cut out the first three stages of the pipeline and use pre-loaded spectrograms of the "go", "stop", "left", and "right" commands. In order to acquire these spectrograms, we had to go into the training Google Colab notebook mentioned before and extract them from the cell used for testing; essentially, we would interrupt the testing process at a certain step and save the current test data (which would be correctly formatted spectrogram) into a c buffer. The edited cell is shown below:

# Helper function to run inference
saved_test = []
def run_tflite_inference(tflite_model_path, model_type="Float"):
  # Load test data
  np.random.seed(0) # set random seed for reproducible test results.
  with tf.Session() as sess:
    test_data, test_labels = audio_processor.get_data(
        -1, 0, model_settings, BACKGROUND_FREQUENCY, BACKGROUND_VOLUME_RANGE,
        TIME_SHIFT_MS, 'testing', sess)
  test_data = np.expand_dims(test_data, axis=1).astype(np.float32)

  # Initialize the interpreter
  interpreter = tf.lite.Interpreter(tflite_model_path)
  interpreter.allocate_tensors()

  input_details = interpreter.get_input_details()[0]
  output_details = interpreter.get_output_details()[0]

  # For quantized models, manually quantize the input data from float to integer
  if model_type == "Quantized":
    input_scale, input_zero_point = input_details["quantization"]
    test_data = test_data / input_scale + input_zero_point
    test_data = test_data.astype(input_details["dtype"])
  
  correct_predictions = 0
  flag = True
  for i in range(len(test_data)):
    if (flag and model_type == "Quantized" and test_labels[i] == 5.0):
      display(test_data[i])
      display(test_data[i][0][0])
      display(test_labels[i])
      with gfile.GFile("left_features.cc", 'w') as f:
        f.write('const int g_width = %d;\n' %
                (model_settings['fingerprint_width']))
        f.write('const int g_height = %d;\n' %
                (model_settings['spectrogram_length']))
        f.write('const unsigned char g_%s_data[] = {' )
        k = 0
        for value in test_data[i][0]:
          if k == 0:
            f.write('\n  ')
          f.write('%d, ' % value)
          k = (k + 1) % 10
        f.write('\n};\n')
      flag = False
    interpreter.set_tensor(input_details["index"], test_data[i])
    interpreter.invoke()
    output = interpreter.get_tensor(output_details["index"])[0]
    top_prediction = output.argmax()
    correct_predictions += (top_prediction == test_labels[i])

  print('%s model accuracy is %f%% (Number of test samples=%d)' % (
      model_type, (correct_predictions * 100) / len(test_data), len(test_data)))

In the if condition, the test_label would be changed based on which class spectrogram we wanted (i.e. 2.0 for "go"). With this we generated 4 spectrograms, one for each command. We then tested these spectrograms by adding them into our pipeline like before; however, we now used the full pipeline including steps 7 and 8. Step 7 essentially averages out the last few model outputs to ensure that the output is as accurate as possible; e.g. if the model hears "go" for a few cycles and gives that as the recognized command, it doesn't immediately give a different classification if it hears a different phrase. If it hears "stop", it would take it a few cycles of averaging the previous scores to finally get a high enough score to recognize "stop" as the command. This sort of "windowing" is done to account for cases where the command is split between multiple inputs; while this isn't the case for us, we have it implemented for completeness. Step 8, where we use the model's output is discussed in the next section.

Integrating The Components Together

In order to make our car move according to the given commands, it was necessary to design a simple finite state machine to keep track of the car states and invoke the model when needed. The way it works is as follows: an integer "carState" keeps track of the current state of the car with the values having the following meanings:

0 = Stop
1 = Go
2 = Left
3 = Right

Had we not faced the previously discussed issue with transforming the audio samples into spectrograms, the input from the microphone would be used to switch between the states; where the spectrograms produced would act as input to the model. However, in order to provide a proof of concept, we used the 4 spectrograms for each of the commands as input to the model, and wanted to design a system in order to switch between each of the four spectrograms; which would in turn control the state of the car. We chose a simple system based on the use of the "Noisy" digital signal. Using a debouncer system, we are able to produce a "word detector", which scans for input sounds from the user and can count the number of words spoken or sounds made within a period of time approximately equal to one second. The user can input 1-3 "words" or sounds, which have the following interpretations:

0 sounds = No change
1 sound = Toggle between "Go" and "Stop" spectrogram
2 sounds = Present "Left" spectrogram to model
3 sounds = Present "Right" spectrogram to model

Based on the input spectrogram and the prediction of the model, the carState variable changes, and a call to the respective function which performs a movement is made accordingly. It is also worth noting that a call to the "turn left" or "turn right" functions (if predicted by the model) returns the car to its previous state after the turn is completed. The previous state can only be the "Go" or "Stop" state, which basically allows the model to go back to moving forward or stopping after performing the required turn.

As for the timing characteristics when running the model (MicroSpeech), they can be summarized in the following screenshot:

As it can be seen, the model's inference takes around ~9600 μs (9.6 ms) while the command recognition time varies. On reset, it takes about 20k μs while during normal operation, it ranges from (30k-50k μs).

Final Demo

The video showcasing the final proof of concept can be seen below.

Home Spring 2023 Projects

Provide feedback

Saved searches

Use saved searches to filter your results more quickly