RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Quantization has become a key method for enabling deep learning (DL) inference on resource-constrained embedded systems. As the demand for privacy-preserving, low-latency, and energy-efficient artificial intelligence (AI) increases, quantization allows models to run efficiently on edge hardware by reducing the precision of weights and activations - often with minimal impact on accuracy. This survey presents a tool-centric analysis of quantization support in twelve widely used embedded artificial intelligence (eAI) frameworks, including TensorFlow Lite, PyTorch, ONNX Runtime, and vendor-specific stacks like Qualcomm's QNN and Intel's OpenVINO. We examine how each tool implements quantization across several axes: supported workflows (post-training vs. quantization-aware training), bit-width flexibility, execution realism (simulated vs. integer kernels), and quantization granularity and schemes. Our findings reveal common patterns - such as the dominance of 8-bit uniform affine quantization - and highlight key distinctions in flexibility, deployment readiness, and hardware integration. We summarize our results in a unified comparison table to guide practitioners and researchers in selecting the most appropriate tool for their deployment needs. Finally, we discuss trends such as mixed-precision quantization and speculate on future directions for eAI tooling.

A Survey of Quantization Techniques in Embedded AI Toolchains

Hasanpour, Mohammad Amin;Fafoutis, Xenofon;Roveri, Manuel

2025-01-01

Abstract

Quantization has become a key method for enabling deep learning (DL) inference on resource-constrained embedded systems. As the demand for privacy-preserving, low-latency, and energy-efficient artificial intelligence (AI) increases, quantization allows models to run efficiently on edge hardware by reducing the precision of weights and activations - often with minimal impact on accuracy. This survey presents a tool-centric analysis of quantization support in twelve widely used embedded artificial intelligence (eAI) frameworks, including TensorFlow Lite, PyTorch, ONNX Runtime, and vendor-specific stacks like Qualcomm's QNN and Intel's OpenVINO. We examine how each tool implements quantization across several axes: supported workflows (post-training vs. quantization-aware training), bit-width flexibility, execution realism (simulated vs. integer kernels), and quantization granularity and schemes. Our findings reveal common patterns - such as the dominance of 8-bit uniform affine quantization - and highlight key distinctions in flexibility, deployment readiness, and hardware integration. We summarize our results in a unified comparison table to guide practitioners and researchers in selecting the most appropriate tool for their deployment needs. Finally, we discuss trends such as mixed-precision quantization and speculate on future directions for eAI tooling.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				2025 IEEE Annual Congress on Artificial Intelligence of Things (AIoT)
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
A_Survey_of_Quantization_Techniques_in_Embedded_AI_Toolchains-2.pdf Accesso riservato Dimensione 1.06 MB Formato Adobe PDF Visualizza/Apri	1.06 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1309044

Citazioni

ND

1

ND

social impact