Accelerating Deep Convolutional Neural Networks on FPGAs is achieving a lot of interest across a wide range of applications such as image recognition and classification. Although memory requirement still represents a challenge for fast and efficient inference workloads, fused-accelerated layers have been recently proposed to mitigate memory bandwidth problems. In this scenario, we propose a tiled-based fused-layer approach to exploit temporal and spatial locality in mapping a DCNN to a specialized low-density FPGA. In our tile-based approach, we propose applying a fusing depth of two convolutional layers in order to fit into a low-density FPGA without affecting the performance, while reducing the memory bandwidth. We demonstrated the effectiveness of our two-fused layer approach to accelerate the VGG16 network on a Zynq-7020. We achieved an average speedup of 1.44x times, while saving up to 61% of memory transactions with respect to a baseline represented by tiled-only (not-fused) version. A tiled-based fully-fused version of VGG16 would not have been feasible to be mapped on a low-density Zynq-7020.

A tile-based fused-layer approach to accelerate DCNNs on low-density FPGAs

Erdem A.;Babic D.;Silvano C.
2019-01-01

Abstract

Accelerating Deep Convolutional Neural Networks on FPGAs is achieving a lot of interest across a wide range of applications such as image recognition and classification. Although memory requirement still represents a challenge for fast and efficient inference workloads, fused-accelerated layers have been recently proposed to mitigate memory bandwidth problems. In this scenario, we propose a tiled-based fused-layer approach to exploit temporal and spatial locality in mapping a DCNN to a specialized low-density FPGA. In our tile-based approach, we propose applying a fusing depth of two convolutional layers in order to fit into a low-density FPGA without affecting the performance, while reducing the memory bandwidth. We demonstrated the effectiveness of our two-fused layer approach to accelerate the VGG16 network on a Zynq-7020. We achieved an average speedup of 1.44x times, while saving up to 61% of memory transactions with respect to a baseline represented by tiled-only (not-fused) version. A tiled-based fully-fused version of VGG16 would not have been feasible to be mapped on a low-density Zynq-7020.
2019
2019 26th IEEE International Conference on Electronics, Circuits and Systems, ICECS 2019
978-1-7281-0996-1
Hardware accelerators for Neural Networks
FPGA
File in questo prodotto:
File Dimensione Formato  
ICECS_2019_08964870.pdf

Accesso riservato

Descrizione: Articolo pubblicato
: Publisher’s version
Dimensione 891.63 kB
Formato Adobe PDF
891.63 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1146032
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact