Bolt is a light-weight library for mobile devices. Bolt, as a universal deployment tool for all kinds of neural networks, aims to minimize the inference runtime as much as possible. Higher speed, better security and more efficient memory management are the advantages that Bolt strives to provide. Feel free to make good use of issue submission, or join our QQ chatroom (Chinese): 833345709.
-
Bolt has almost supported all the ARM-A devices incude ARMv7/ARMv8/ARMv8.2/Mali-GPU. FP16/BNN for CPU and FP16 for GPU are highly optimized. Bolt also support FP32 on ARMv7/ARMv8/ARMv8.2 devices.
Bolt has its own format of model storage, which helps reduce the memory footprint by storing in FP16, INT8 and 1-bit representations when possible. We provide model converters for the following formats:
- caffe
- onnx
- tflite
For PyTorch and TensorFlow models, please try to convert them to the onnx or tflite format first. We also had some success in converting these models into customized caffe models.
-
Bolt has shown its high performance in the inference of common CV and NLP neural networks. Some of the representative networks that we have verified are listed below. You can find detailed benchmark information in docs/BENCHMARK.md.
- Squeezenet
- Mobilenet v1, v2, v3
- Resnet50, Ghostnet (plus FPN detection)
- Birealnet18 (BNN)
- SSD(Resnet)
- Bert, TinyBert, Albert
- Neural Machine Translation
- Automatic Speech Recognition
- Text To Speech
For MALI GPU FP16 Support
- Squeezenet v1.1
- Mobilenet v1, v2, v3
- Ghostnet
-
Apart from the refined acceleration of convolutions and GeMM for the supported data precisions, Bolt has a easy use and powerful inference graph optimizer. As shown in model-tools/include, classic operator fusion is supported. Bolt is also equipped with a Memory Reuse Optmizer, which reassigns the space occupied by a feature map as soon as it is no longer needed as input or output. Most networks that we tested benefit from a two-third reduction in feature map storage.
-
Users can specify the preferred policy (high-performance or low-power). Bolt will select the most suitable core and set the thread affinity.
-
Bolt can tailor-make the algorithm configuration for your specific target device.
Bolt provides install.sh for fast installation. The major third-party dependency is protobuf, and some other may come from the original model format that you want to use. You may also need libjpeg for building tests/classification.
After configuring bolt.cmake, the compilation can be as simple as:
./install.sh -t 48 -c llvm
For more details, please refer to docs/INSTALL.md
As a user, what you are normally concerned about include the following 4 parts:
- API (We guarantee that the C API will not be changed in the future)
- Model Preparation
- Model Conversion
- Model Inference
For the details, please refer to docs/USER_HANDBOOK.md
-
We welcome all kinds of contribution. Before that, let's get familiar with the project structure.
-
- uni hosts the common headers that are used in the project.
- gcl hosts the setup of MALI GPU environment.
- image hosts common preprocessing routines for image inputs (e.g. bilinear interpolation).
- blas-enhance hosts the fast implementation of matrix-matrix multiplication and matrix-vector multiplication of FP32, FP16 and INT8. It is referenced by some of the operators in tensor_computing.
- tensor_computing hosts the implementation for individual operators.
- model-tools hosts everything related to model conversion and optimization.
- inference hosts the inference engine of neural networks.
- Lastly, tests include all the unit tests for the above functionalities.
To support your own network, you can first try to convert it with the provided tools. If an operator is missing, you can first add the conversion to model-tools. You may then implement the missing computation routine in tensor_computing. Please also define a class for your new operator in inference.
All contributions are welcomed. For the details, please refer to docs/DEVELOPER.md
We provide a detailed benchmark report for your reference. For more testing information please refer to docs/BENCHMARK.md .
Future Release 2020-09-01
- Yolo support
- TensorFlow model converter
- HUAWEI CBG
- HUAWEI PORTFOLIO
-
Why configuring bolt.cmake does not take effect?
The install.sh serves as an example of compilation setup, and it overwrites some settings in bolt.cmake. Please check install.sh first.
-
More details about dependency libraries for cross-compilation?
The major dependency is Protobuf. Protoc should be the x86 version but protbuf should be the ARM version.
-
Requirements on tensor dimensions?
For optimal performance, Bolt requires the number of output channels to be divisible by 8. For compatibility, Bolt will try to pad the output channels of convolution layers to the nearest multiple of 8. You can turn on USE_DEBUG in bolt.cmake to check the actual dimensions.
-
Restrictions for BNN?
For BNN convolution layers, the number of output channels must be divisible by 32.
-
Restrictions on quantization (int8)?
For the time being, Bolt only supports post-training int8 quantization. The quantization method is symmetrical for both activation and weight. We have added a calibration tool for image CNN pipelines. Please feel free to report cases of usage failures.
-
Requirements for fp16 and int8?
Only arm-v8.2 supports fp16 and int8 dotprod instructions.
-
Restrictions for MALI?
Only llvm compilation supports MALI computing.
Bolt refers to the following projects: caffe, onnx, protobuf, flatbuffers, ncnn, mnn, dabnn.
The MIT License(MIT)