Accelerating Deep Convolutional Neural Networks on FPGAs is achieving a lot of interest across a wide range of applications such as image recognition and classification. Although memory requirement still represents a challenge for fast and efficient inference workloads, fused-accelerated layers have been recently proposed to mitigate memory bandwidth problems. In this scenario, we propose a tiled-based fused-layer approach to exploit temporal and spatial locality in mapping a DCNN to a specialized low-density FPGA. In our tile-based approach, we propose applying a fusing depth of two convolutional layers in order to fit into a low-density FPGA without affecting the performance, while reducing the memory bandwidth. We demonstrated the effectiveness of our two-fused layer approach to accelerate the VGG16 network on a Zynq-7020. We achieved an average speedup of 1.44x times, while saving up to 61% of memory transactions with respect to a baseline represented by tiled-only (not-fused) version. A tiled-based fully-fused version of VGG16 would not have been feasible to be mapped on a low-density Zynq-7020.
A tile-based fused-layer approach to accelerate DCNNs on low-density FPGAs
Erdem A.;Babic D.;Silvano C.
2019-01-01
Abstract
Accelerating Deep Convolutional Neural Networks on FPGAs is achieving a lot of interest across a wide range of applications such as image recognition and classification. Although memory requirement still represents a challenge for fast and efficient inference workloads, fused-accelerated layers have been recently proposed to mitigate memory bandwidth problems. In this scenario, we propose a tiled-based fused-layer approach to exploit temporal and spatial locality in mapping a DCNN to a specialized low-density FPGA. In our tile-based approach, we propose applying a fusing depth of two convolutional layers in order to fit into a low-density FPGA without affecting the performance, while reducing the memory bandwidth. We demonstrated the effectiveness of our two-fused layer approach to accelerate the VGG16 network on a Zynq-7020. We achieved an average speedup of 1.44x times, while saving up to 61% of memory transactions with respect to a baseline represented by tiled-only (not-fused) version. A tiled-based fully-fused version of VGG16 would not have been feasible to be mapped on a low-density Zynq-7020.File | Dimensione | Formato | |
---|---|---|---|
ICECS_2019_08964870.pdf
Accesso riservato
Descrizione: Articolo pubblicato
:
Publisher’s version
Dimensione
891.63 kB
Formato
Adobe PDF
|
891.63 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.