Recent advancements in the field of ultra-low-power machine learning (TinyML) promises to unlock an entirely new class of edge applications. However, continued progress is restrained by the lack of benchmarking Machine Learning (ML) models on TinyML hardware, which is fundamental to this field reaching maturity. In this paper, we designed 3 types of fully connected Neural Networks (NNs), trained each NN using 10 datasets (produces 30 NNs), and present the benchmark by reporting the onboard model performance on 7 popular MCU-boards (similar boards are used to design TinyML hardware). We open-sourced and made the complete benchmark results freely available online to enable the TinyML community researchers and developers to systematically compare, evaluate, and improve various aspects during the design phase of ML-powered IoT hardware.
B1: Teensy 4.0 (Cortex-M7 @600 MHz, 2MB Flash, 1MB SRAM)
B2: STM32 Nucleo H7 (Cortex-M7 @480 MHz, 2MB Flash, 1 MB SRAM)
B3: Arduino Portenta (Cortex-M7+M4 @480 MHz, 2MB Flash, 1MB SRAM)
B4: Feather M4 Express (Cortex-M4 @120 MHz, 2MB Flash, 192KB SRAM)
B5: Generic ESP32 (Xtensa LX6 @240 MHz, 4MB Flash, 520KBSRAM)
B6: Arduino Nano 33 (Cortex-M4 @64 MHz, 1MB Flash, 256KB SRAM)
B7: Raspberry Pi Pico (Cortex-M0+ @133 MHz, 16MB Flash, 264KB SRAM)
D1 Iris Flowers: (4 features, 3 classes, 150 samples)
D2: Wine: (13 features, 3 classes, 178)
D3: Vowel: (13 features, 11 classes, 989 samples)
D4: Statlog Vehicle Silhouettes: (18 features, 4 classes, 845 samples)
D5: Anuran Calls: (64 features, 10 classes, 1797 samples)
D6: Breast Cancer: (30 features, 2 classes, 569 samples)
D7: Texture: (40 features, 11 classes, 5000 samples)
D8: Sensorless Drive Diagnosis: (48 features, 11 classes, 999 samples)
D9: MNIST Handwritten Digits: (64 features, 10 classes, 1797 samples)
D10: Human Activity: (74 features, 6 classes, 5000 samples)
FC 1 x 10: 1 layer with 10 neurons:
FC 10 x 10: 10 layers, where each layer contains 10 neurons:
FC 10 + 50: 2 layers, where 1st layer contains 10 neurons, and 2nd layer contains 50 neurons:
The below Figure (y-axis in base-10 log scale) presents the average time taken by MCU boards B1 - B7 to infer using D1 - D10.
- For all 3 NN types, Teensy 4.0 (B1) is the fastest as it performed unit inference in 3.14 µs, 11.13 µs, 18.12 µs respectively.
- For the same data samples, Raspberry Pi Pico (B7) is the slowest (≈ 99 - 175 x times slower than B1), as it took 313.77 µs, 1953.96 µs, 2801.82 µs.
- Although B7 has a faster clock than Arduino Nano 33 (B6), it is still slow as Cortex M4 is superior to Cortex M0+.
- Although B1 - B4 has the same Cortex M7 processor, B1 still is significantly faster as it has the highest clock speed of 600 MHz.
The below Figure (y-axis in base-10 log scale) presents the complete inference time on the STM32 Nucleo H7 (B2) for each of the 30 models.
- When considering the FC 1x10 network, for the 4 features Iris dataset (D1), it took 5.16 µs to infer, and for the highest 74 features Human Activity dataset (D10), it took 872.85 µs to infer.
- When considering FC 10x10, for the Iris dataset, it took 20.15 µs, and 3369.54 µs for the Human Activity dataset.
The below Figure presents the time taken by Arduino IDE to compile each of the 30 models for STM32 Nucleo H7 (B2), along with the complete FLASH and SRAM requirements. The models trained using the datasets with more features, classes consumed higher compilation time, and higher fash memory.
If you find our TinyML benchmark helpful for your work, please cite this paper using the BibTex entry below.
@inproceedings{BharathTinyML,
author = {Bharath Sudharsan and Simone Salerno and Duc-Duy Nguyen and Muhammad Yahya and Abdul Wahid and Piyush Yadav and John G. Breslin and Muhammad Intizar Alii},
title = {TinyML Benchmark: Executing Fully Connected Neural Networks on Commodity Microcontrollers },
booktitle = {IEEE 7th World Forum on Internet of Things},
year = {2021}
}
For any clarification/further information please don't hesitate to contact me. Email: bharathsudharsan023@gmail.com