Keyword Recognition (KWS) is critical to enabling voice-based user interaction on smart devices and requires real-time responsiveness and high accuracy in order to ensure a good user experience. Recently, neural networks have become a popular choice for KWS architectures because, unlike traditional Compared with speech processing algorithms, the accuracy of neural networks is superior.
Keywords identification neural network pipeline
KWS applications have a very limited power budget due to their "always on." Although KWS applications can also run on dedicated DSPs or high-performance CPUs, they are better suited for operation on Arm Cortex-M microcontrollers and are helpful To minimize cost, Arm Cortex-M microcontrollers are often used at the edge of the Internet of Things for other tasks.
However, to deploy a neural network based KWS on a Cortex-M-based microcontroller we face the following challenges: 1. Limited Memory Space A typical Cortex-M system offers up to a few hundred kilobytes of free memory, which means , The entire neural network model, including inputs / outputs, weights, and activations, must all operate within this small memory range. 2. Limited Computing Resources Because KWS is always on-line, this real-time requirement limits the number of neurons The following is a typical neural network architecture for KWS inference: • Deep Neural Network (DNN) DNN is a standard feedforward neural network made up of fully connected layers and non-linear active layers stacked • Volume One of the major drawbacks of CNN DNN-based KWS is its inability to model local, temporal, and frequency domain correlations in speech functions. The CNN models the input temporal and frequency domain features As an image processing, and perform 2D convolution operations on it to find this correlation • Circular Neural Network (RNN) RNN has shown excellent performance in many sequence modeling tasks In particular, in speech recognition, language modeling and translation, RNNs can not only detect the time-domain relationship between input signals but also capture the long-term dependencies using the 'gated' mechanism • Convolutional Recurrent Neural Networks CRNN) Convolutional Recurrent Neural Networks are a mixture of CNNs and RNNs that find local time / space correlations. The CRNN model begins with a convolutional layer followed by an RNN that encodes the signal followed by a densely-connected layer. • Depth Detachable Convolutional Neural Networks (DS-CNNs) Recently, deep separable convolutional neural networks have been proposed as efficient alternatives to standard 3D convolutional operations and have been used to implement a compact network architecture for computer vision. The DS-CNN first used Independent 2D filtering to convolve each channel in the input feature map and then merge the outputs in the depth dimension using a point-wise convolution (ie 1x1) By decomposing the standard 3D convolution into 2D and subsequent 1D , The number of parameters and operations is reduced, making deeper and wider architectures possible, even in resource-constrained microcontroller devices. Running Off on Cortex-M Processors When word recognition is used, memory usage and execution time are the two most important factors that should be taken into account when designing and optimizing a neural network for this purpose. The three sets of limits for the neural networks shown below are for small , Medium and large Cortex-M systems, based on a typical Cortex-M system configuration.
To adjust the model so that it does not exceed the microcontroller's memory and computational limits, you must perform hyperparametric search The following table shows the neural network architecture and the corresponding hyperparameters that must be optimized.
First perform an exhaustive search of the feature extraction and neural network model hyperparameters and then perform a manual selection to narrow the search space, both of which are performed repeatedly.The following figure summarizes the best performance model and corresponding memory for each neural network architecture Requirements and Computations. The DS-CNN architecture provides the highest level of accuracy and requires significantly less memory and computational resources.
The KWS application is deployed on the Cortex-M7-based STM32F746G-DISCO development board (shown below), using a DNN model with 8-bit weight and 8-bit activation, and KWS performs 10 injections per second at runtime. Each inference ( Including memory copy, MFCC feature extraction, DNN execution) takes approximately 12 ms. To save power, the microcontroller is in wait-for-interrupt (WFI) mode for the rest of the time. The entire KWS application consumes approximately 70 KB of memory, including approximately 66 KB About 1 KB for activation, about 2 KB for audio I / O and MFCC features.
All in all, the Arm Cortex-M processor achieves high accuracy in keyword recognition applications while limiting memory and computational requirements by adjusting the network architecture.The DS-CNN architecture provides the highest level of accuracy and memory and computational resources required Much lower. Codes, model definitions, and pre-training models are available at github.com/ARM-software Our new machine learning developer web site provides a one-stop resource gallery, detailed product information and tutorials to help counter machines at the network edge Learn the Challenges That Face This blog is based on the white paper, "Hello Edge: Keyword Spotting on Microcontrollers," originally published on the Cornell University Library website. To download the Arm For a copy of the white paper, click on the link below: https://community.arm.com/cfs-file/__key/telligent-evolution-components-attachments/01-2142-00-00-00-01-34-26/Arm_5F00_KeywordSpotting_5F00_Whitepaper. pdf