At present, in many "always-on" IoT edge devices that require local data analysis, neural networks are becoming increasingly popular. This is primarily due to their ability to significantly reduce the latency and power consumption associated with data transmission. When considering neural networks on IoT edge devices, the Arm Cortex-M series processor cores naturally come to mind. If you're looking to boost performance while minimizing memory usage, CMSIS-NN is the ideal choice. Neural network inference using the CMSIS-NN kernel can achieve a 4.6X improvement in runtime/throughput and a 4.9X improvement in energy efficiency.
The CMSIS-NN library is composed of two main components: NNFunction and NNSupportFunctions. The NNFunction module contains functions for common neural network layers such as convolution, depthwise separable convolution, fully connected (inner-product), pooling, and activation. These functions are used by application code to implement neural network inference. The kernel API is designed to be simple, allowing easy integration with any machine learning framework. The NNSupportFunctions module includes utility functions like data conversion and activation functions, which are also available for use in building more complex neural network modules, such as Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU).
For certain kernels, such as fully connected layers and convolutions, different versions of kernel functions are available. Arm provides a base version that can be used "as-is" for any layer parameters. Additional optimized versions have also been deployed, but they may involve input conversion or have specific constraints on layer parameters. Ideally, a simple script could analyze your network topology and automatically select the most appropriate function.
We tested the CMSIS-NN kernel on a Convolutional Neural Network (CNN) trained on the CIFAR-10 dataset, which consists of 60,000 32x32 color images across 10 output classes. The network architecture was based on the built-in example from Caffe, featuring three convolutional layers and one fully connected layer. The following table shows the layer parameters and detailed runtime results when using the CMSIS-NN core. The test was conducted on an ARM Cortex-M7 core running on the STMicroelectronics NUCLEO-F746ZG mbed development board at 216 MHz.
The entire image classification process takes approximately 99.1 milliseconds per image, equivalent to about 10.1 images per second. The CPU's computational throughput is around 249 MOps per second. The pre-quantized network achieved an accuracy of 80.3% on the CIFAR-10 test set, while the 8-bit quantized network on the ARM Cortex-M7 reached 79.9%. The maximum memory footprint when using the CMSIS-NN kernel is approximately 133 KB, thanks to the use of local im2col for convolution to save memory. Without local im2col, the memory usage would be around 332 KB, making it impossible for the neural network to run on the board.
To quantify the advantages of the CMSIS-NN core over existing solutions, we also implemented a benchmark version using a one-dimensional convolution function (arm_conv from CMSIS-DSP), similar to Caffe’s pooling and ReLU operations. For CNN applications, the following table summarizes the comparison between the benchmark function and the CMSIS-NN core. The runtime/throughput of the CMSIS-NN core is 2.6 to 5.4 times higher than the benchmark function, and the energy efficiency improvement aligns closely with the throughput gains.
An efficient neural network core is essential for maximizing the performance of the Arm Cortex-M CPU. CMSIS-NN offers optimized functions to accelerate key neural network layers, including convolution, pooling, and activation. Moreover, it plays a critical role in reducing the memory footprint, which is especially important for microcontrollers with limited memory resources.
Label/Webbing Cutting Machine,Angle Cutting Machine,Textile Laser Cutter,Material Cutting Table
Kunshan Bolun Automation Equipment Co., Ltd , https://www.bolunmachinery.com