Due to the always-on nature of keyword spotting (KWS) systems, low power consumption micro-controller units (MCU) are the best choices as deployment devices. However, small computation power and memory budget of MCUs can harm the accuracy requirements. Although, many studies have been conducted to design small memory footprint neural networks to address this problem, the effects of different feature extraction settings are rarely studied. This work addresses this important question by first, comparing six of the most popular and state of the art neural network architectures for KWS on the Google Speech-Commands dataset. Then, keeping the network architectures unchanged it performs comprehensive investigations on the effects of different frequency transformation settings, such as number of used mel-frequency cepstrum coefficients (MFCCs) and length of the stride window, on the accuracy and memory footprint (RAM/ROM) of the models. The results show different preprocessing settings can change the accuracy and RAM/ROM requirements significantly of the models. Furthermore, it is shown that DS-CNN outperforms the other architectures in terms of accuracy with a value of 93.47% with least amount of ROM requirements, while the GRU outperforms all other networks with an accuracy of 91.02% with smallest RAM requirements.

Studying the effects of feature extraction settings on the accuracy and memory requirements of neural networks for keyword spotting

Shahnawaz M.;Marcon M.
2018-01-01

Abstract

Due to the always-on nature of keyword spotting (KWS) systems, low power consumption micro-controller units (MCU) are the best choices as deployment devices. However, small computation power and memory budget of MCUs can harm the accuracy requirements. Although, many studies have been conducted to design small memory footprint neural networks to address this problem, the effects of different feature extraction settings are rarely studied. This work addresses this important question by first, comparing six of the most popular and state of the art neural network architectures for KWS on the Google Speech-Commands dataset. Then, keeping the network architectures unchanged it performs comprehensive investigations on the effects of different frequency transformation settings, such as number of used mel-frequency cepstrum coefficients (MFCCs) and length of the stride window, on the accuracy and memory footprint (RAM/ROM) of the models. The results show different preprocessing settings can change the accuracy and RAM/ROM requirements significantly of the models. Furthermore, it is shown that DS-CNN outperforms the other architectures in terms of accuracy with a value of 93.47% with least amount of ROM requirements, while the GRU outperforms all other networks with an accuracy of 91.02% with smallest RAM requirements.
2018
IEEE International Conference on Consumer Electronics - Berlin, ICCE-Berlin
978-1-5386-6095-9
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1208931
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? 0
social impact