RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Accelerating Deep Convolutional Neural Networks on FPGAs is achieving a lot of interest across a wide range of applications such as image recognition and classification. Although memory requirement still represents a challenge for fast and efficient inference workloads, fused-accelerated layers have been recently proposed to mitigate memory bandwidth problems. In this scenario, we propose a tiled-based fused-layer approach to exploit temporal and spatial locality in mapping a DCNN to a specialized low-density FPGA. In our tile-based approach, we propose applying a fusing depth of two convolutional layers in order to fit into a low-density FPGA without affecting the performance, while reducing the memory bandwidth. We demonstrated the effectiveness of our two-fused layer approach to accelerate the VGG16 network on a Zynq-7020. We achieved an average speedup of 1.44x times, while saving up to 61% of memory transactions with respect to a baseline represented by tiled-only (not-fused) version. A tiled-based fully-fused version of VGG16 would not have been feasible to be mapped on a low-density Zynq-7020.

A tile-based fused-layer approach to accelerate DCNNs on low-density FPGAs

Erdem A.;Babic D.;Silvano C.

2019-01-01

Abstract

Accelerating Deep Convolutional Neural Networks on FPGAs is achieving a lot of interest across a wide range of applications such as image recognition and classification. Although memory requirement still represents a challenge for fast and efficient inference workloads, fused-accelerated layers have been recently proposed to mitigate memory bandwidth problems. In this scenario, we propose a tiled-based fused-layer approach to exploit temporal and spatial locality in mapping a DCNN to a specialized low-density FPGA. In our tile-based approach, we propose applying a fusing depth of two convolutional layers in order to fit into a low-density FPGA without affecting the performance, while reducing the memory bandwidth. We demonstrated the effectiveness of our two-fused layer approach to accelerate the VGG16 network on a Zynq-7020. We achieved an average speedup of 1.44x times, while saving up to 61% of memory transactions with respect to a baseline represented by tiled-only (not-fused) version. A tiled-based fully-fused version of VGG16 would not have been feasible to be mapped on a low-density Zynq-7020.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2019
			
	Titolo del libro
	
				2019 26th IEEE International Conference on Electronics, Circuits and Systems, ICECS 2019
			
	ISBN (International Standard Book Number)
	
				978-1-7281-0996-1
			
	Parole chiave
	
				Hardware accelerators for Neural Networks
FPGA
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
ICECS_2019_08964870.pdf Accesso riservato Descrizione: Articolo pubblicato : Publisher’s version Dimensione 891.63 kB Formato Adobe PDF Visualizza/Apri	891.63 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1146032

Citazioni

ND

3

2

social impact