RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Given the increasing popularity of Edge AI, embedded neural processing units (NPUs) are gradually becoming a standard feature in microcontrollers (MCUs) and System-on-a-Chip (SoCs). The deployment of neural networks on accelerators needs specialized neural network compilers that incorporate graph optimization stages, where layer-specific transformations are applied to reduce execution latency or memory footprint on platform-specific computing elements. For this reason, neural network compilers expose control parameters to be tuned for each individual network layer. The challenge addressed in this paper is finding an optimal combination of neural network compilation parameters for the efficient utilization of the computing resources of the target hardware accelerators. To address this task despite the huge space of parameters, we propose a greedy algorithm that iterates through the convolutional layers of the network, while preserving a set of solutions for the preceding layers. We evaluated this approach by transforming the graphs of some popular neural networks to optimize their performance and memory footprint, mapping them onto an experimental embedded NPU developed by STMicroelectronics using its associated neural network compiler. For the reported set of network models, the proposed technique has improved latency and memory footprint by 43% approximately compared to the baseline and exceeded the simulated annealing heuristics by 15% approximately.

Layer-wise Exploration of a Neural Processing Unit Compiler's Optimization Space

Fabrizio Indirli;Andrea Carlo Ornstein;Giuseppe Desoli;Alessandro Buschini;Cristina Silvano;Vittorio Zaccaria

2024-01-01

Abstract

Given the increasing popularity of Edge AI, embedded neural processing units (NPUs) are gradually becoming a standard feature in microcontrollers (MCUs) and System-on-a-Chip (SoCs). The deployment of neural networks on accelerators needs specialized neural network compilers that incorporate graph optimization stages, where layer-specific transformations are applied to reduce execution latency or memory footprint on platform-specific computing elements. For this reason, neural network compilers expose control parameters to be tuned for each individual network layer. The challenge addressed in this paper is finding an optimal combination of neural network compilation parameters for the efficient utilization of the computing resources of the target hardware accelerators. To address this task despite the huge space of parameters, we propose a greedy algorithm that iterates through the convolutional layers of the network, while preserving a set of solutions for the preceding layers. We evaluated this approach by transforming the graphs of some popular neural networks to optimize their performance and memory footprint, mapping them onto an experimental embedded NPU developed by STMicroelectronics using its associated neural network compiler. For the reported set of network models, the proposed technique has improved latency and memory footprint by 43% approximately compared to the baseline and exceeded the simulated annealing heuristics by 15% approximately.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2024
			
	Titolo del libro
	
				ICCTA '24: Proceedings of the 2024 10th International Conference on Computer Technology Applications
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1271888

Citazioni

ND

ND

ND

social impact