The proliferation of embedded Neural Processing Units (NPUs) is enabling the adoption of Tiny Machine Learning for numerous cognitive computing applications on the edge, where maximizing energy efficiency is key. To overcome the limitations of traditional Von Neumann architectures, novel designs based on computational memories are arising. STMicroelectronics is developing an experimental low-power NPU that integrates Digital In-Memory Computing (DIMC) SRAM with a modular dataflow inference engine, capable of accelerating a wide range of DNNs. In this work, we present a 40nm preliminary version of this architecture with DIMC-SRAM tiles capable of in-memory binary computations to dramatically increase the computational efficiency of binary layers. We performed power/performance analysis to demonstrate the advantages of this paradigm, which in our experiments achieved a TOPS/W efficiency up to 40x higher than traditional NPU implementations. We have then extended the ST Neural compilation toolchain to automatically map binary and mixed-precision NNs on the NPU, applying high-level optimizations and binding the models’ binary GEMM and CONV layers to the DIMC tiles. The overall system was validated by developing three real-time applications that represent potential real-world power-constrained use-cases: Fan spinning anomaly detection, Keyword spotting and Face Presence Detection. The applications ran with a latency < 3 ms, and the DIMC subsystem achieved a peak efficiency > 100 TOPS/W for binary in-memory computations

Accelerating Binary and Mixed-Precision NNs Inference on STMicroelectronics Embedded NPU with Digital In-Memory-Computing

Fabrizio Indirli;Cristina Silvano
2023-01-01

Abstract

The proliferation of embedded Neural Processing Units (NPUs) is enabling the adoption of Tiny Machine Learning for numerous cognitive computing applications on the edge, where maximizing energy efficiency is key. To overcome the limitations of traditional Von Neumann architectures, novel designs based on computational memories are arising. STMicroelectronics is developing an experimental low-power NPU that integrates Digital In-Memory Computing (DIMC) SRAM with a modular dataflow inference engine, capable of accelerating a wide range of DNNs. In this work, we present a 40nm preliminary version of this architecture with DIMC-SRAM tiles capable of in-memory binary computations to dramatically increase the computational efficiency of binary layers. We performed power/performance analysis to demonstrate the advantages of this paradigm, which in our experiments achieved a TOPS/W efficiency up to 40x higher than traditional NPU implementations. We have then extended the ST Neural compilation toolchain to automatically map binary and mixed-precision NNs on the NPU, applying high-level optimizations and binding the models’ binary GEMM and CONV layers to the DIMC tiles. The overall system was validated by developing three real-time applications that represent potential real-world power-constrained use-cases: Fan spinning anomaly detection, Keyword spotting and Face Presence Detection. The applications ran with a latency < 3 ms, and the DIMC subsystem achieved a peak efficiency > 100 TOPS/W for binary in-memory computations
2023
embedded world Conference 2023 Proceedings
978-3-645-50197-2
File in questo prodotto:
File Dimensione Formato  
ewc23_paper.pdf

Accesso riservato

: Publisher’s version
Dimensione 5.32 MB
Formato Adobe PDF
5.32 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1237616
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact