A new invention at one of the MIT’s labs will now enable a low-power chip specialised for automatic speech recognition and enable voice control for IoT. These chips deliver a power savings of 90 to 99 percent compared to existing solutions and it could enable voice control practical for relatively simple electronic devices such as IoT sensors.
Modern smart phones running speech-recognition engine might require about 1 watt of power, these new chip requires between 0.2 and 10 milliwatts only – depending on the number of words it has to recognise. In IoT deployments this probably is a power savings of 90 to 99 percent, which could make voice control practical for relatively simple electronic devices. That includes power-constrained devices in IoT that have to harvest energy from their environments or go months between battery charges.
"Speech input will become a natural interface for many wearable applications and intelligent devices,The miniaturization of these devices will require a different interface than touch or keyboard. It will be critical to embed the speech functionality locally to save system energy consumption compared to performing this operation in the cloud.”
says Anantha Chandrakasan, the Vannevar Bush Professor of Electrical Engineering and Computer Science at MIT, whose group developed the new chip.
Today, the best-performing speech recognisers are, like many other state-of-the-art artificial-intelligence systems, based on neural networks, virtual networks of simple information processors roughly modeled on the human brain. Much of the new chip’s circuitry is concerned with implementing speech-recognition networks as efficiently as possible.
But even the most power-efficient speech recognition system would quickly drain a device’s battery if it ran without interruption. So the chip also includes a simpler “voice activity detection” circuit that monitors ambient noise to determine whether it might be speech. If the answer is yes, the chip fires up the larger, more complex speech-recognition circuit.
A typical neural network consists of thousands of processing “nodes” capable of only simple computations but densely connected to each other. In the type of network commonly used for voice recognition, the nodes are arranged into layers. Voice data are fed into the bottom layer of the network, whose nodes process and pass them to the nodes of the next layer, whose nodes process and pass them to the next layer, and so on. The output of the top layer indicates the probability that the voice data represents a particular speech sound.
A voice-recognition network is too big to fit in a chip’s onboard memory, which is a problem because going off-chip for data is much more energy intensive than retrieving it from local stores. So the MIT researchers’ design concentrates on minimizing the amount of data that the chip has to retrieve from off-chip memory.