Given the increasing popularity of Edge AI, embedded neural processing units (NPUs) are gradually becoming a standard feature in microcontrollers (MCUs) and System-on-a-Chip (SoCs). The deployment of neural networks on accelerators needs specialized neural network compilers that incorporate graph optimization stages, where layer-specific transformations are applied to reduce execution latency or memory footprint on platform-specific computing elements. For this reason, neural network compilers expose control parameters to be tuned for each individual network layer. The challenge addressed in this paper is finding an optimal combination of neural network compilation parameters for the efficient utilization of the computing resources of the target hardware accelerators. To address this task despite the huge space of parameters, we propose a greedy algorithm that iterates through the convolutional layers of the network, while preserving a set of solutions for the preceding layers. We evaluated this approach by transforming the graphs of some popular neural networks to optimize their performance and memory footprint, mapping them onto an experimental embedded NPU developed by STMicroelectronics using its associated neural network compiler. For the reported set of network models, the proposed technique has improved latency and memory footprint by 43% approximately compared to the baseline and exceeded the simulated annealing heuristics by 15% approximately.
Layer-wise Exploration of a Neural Processing Unit Compiler's Optimization Space
Fabrizio Indirli;Cristina Silvano;Vittorio Zaccaria
2024-01-01
Abstract
Given the increasing popularity of Edge AI, embedded neural processing units (NPUs) are gradually becoming a standard feature in microcontrollers (MCUs) and System-on-a-Chip (SoCs). The deployment of neural networks on accelerators needs specialized neural network compilers that incorporate graph optimization stages, where layer-specific transformations are applied to reduce execution latency or memory footprint on platform-specific computing elements. For this reason, neural network compilers expose control parameters to be tuned for each individual network layer. The challenge addressed in this paper is finding an optimal combination of neural network compilation parameters for the efficient utilization of the computing resources of the target hardware accelerators. To address this task despite the huge space of parameters, we propose a greedy algorithm that iterates through the convolutional layers of the network, while preserving a set of solutions for the preceding layers. We evaluated this approach by transforming the graphs of some popular neural networks to optimize their performance and memory footprint, mapping them onto an experimental embedded NPU developed by STMicroelectronics using its associated neural network compiler. For the reported set of network models, the proposed technique has improved latency and memory footprint by 43% approximately compared to the baseline and exceeded the simulated annealing heuristics by 15% approximately.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.