The proliferation of embedded Neural Processing Units (NPUs) is enabling the adoption of Tiny Machine Learning for numerous cognitive computing applications on the edge, where maximizing energy efficiency is key. To overcome the limitations of traditional Von Neumann architectures, novel designs based on computational memories are arising. STMicroelectronics is developing an experimental low-power NPU that integrates Digital In-Memory Computing (DIMC) SRAM with a modular dataflow inference engine, capable of accelerating a wide range of DNNs. In this work, we present a 40nm preliminary version of this architecture with DIMC-SRAM tiles capable of in-memory binary computations to dramatically increase the computational efficiency of binary layers. We performed power/performance analysis to demonstrate the advantages of this paradigm, which in our experiments achieved a TOPS/W efficiency up to 40x higher than traditional NPU implementations. We have then extended the ST Neural compilation toolchain to automatically map binary and mixed-precision NNs on the NPU, applying high-level optimizations and binding the models’ binary GEMM and CONV layers to the DIMC tiles. The overall system was validated by developing three real-time applications that represent potential real-world power-constrained use-cases: Fan spinning anomaly detection, Keyword spotting and Face Presence Detection. The applications ran with a latency < 3 ms, and the DIMC subsystem achieved a peak efficiency > 100 TOPS/W for binary in-memory computations
Accelerating Binary and Mixed-Precision NNs Inference on STMicroelectronics Embedded NPU with Digital In-Memory-Computing
Fabrizio Indirli;Cristina Silvano
2023-01-01
Abstract
The proliferation of embedded Neural Processing Units (NPUs) is enabling the adoption of Tiny Machine Learning for numerous cognitive computing applications on the edge, where maximizing energy efficiency is key. To overcome the limitations of traditional Von Neumann architectures, novel designs based on computational memories are arising. STMicroelectronics is developing an experimental low-power NPU that integrates Digital In-Memory Computing (DIMC) SRAM with a modular dataflow inference engine, capable of accelerating a wide range of DNNs. In this work, we present a 40nm preliminary version of this architecture with DIMC-SRAM tiles capable of in-memory binary computations to dramatically increase the computational efficiency of binary layers. We performed power/performance analysis to demonstrate the advantages of this paradigm, which in our experiments achieved a TOPS/W efficiency up to 40x higher than traditional NPU implementations. We have then extended the ST Neural compilation toolchain to automatically map binary and mixed-precision NNs on the NPU, applying high-level optimizations and binding the models’ binary GEMM and CONV layers to the DIMC tiles. The overall system was validated by developing three real-time applications that represent potential real-world power-constrained use-cases: Fan spinning anomaly detection, Keyword spotting and Face Presence Detection. The applications ran with a latency < 3 ms, and the DIMC subsystem achieved a peak efficiency > 100 TOPS/W for binary in-memory computationsFile | Dimensione | Formato | |
---|---|---|---|
ewc23_paper.pdf
Accesso riservato
:
Publisher’s version
Dimensione
5.32 MB
Formato
Adobe PDF
|
5.32 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.