Quantization has become a key method for enabling deep learning (DL) inference on resource-constrained embedded systems. As the demand for privacy-preserving, low-latency, and energy-efficient artificial intelligence (AI) increases, quantization allows models to run efficiently on edge hardware by reducing the precision of weights and activations - often with minimal impact on accuracy. This survey presents a tool-centric analysis of quantization support in twelve widely used embedded artificial intelligence (eAI) frameworks, including TensorFlow Lite, PyTorch, ONNX Runtime, and vendor-specific stacks like Qualcomm's QNN and Intel's OpenVINO. We examine how each tool implements quantization across several axes: supported workflows (post-training vs. quantization-aware training), bit-width flexibility, execution realism (simulated vs. integer kernels), and quantization granularity and schemes. Our findings reveal common patterns - such as the dominance of 8-bit uniform affine quantization - and highlight key distinctions in flexibility, deployment readiness, and hardware integration. We summarize our results in a unified comparison table to guide practitioners and researchers in selecting the most appropriate tool for their deployment needs. Finally, we discuss trends such as mixed-precision quantization and speculate on future directions for eAI tooling.

A Survey of Quantization Techniques in Embedded AI Toolchains

Roveri, Manuel
2025-01-01

Abstract

Quantization has become a key method for enabling deep learning (DL) inference on resource-constrained embedded systems. As the demand for privacy-preserving, low-latency, and energy-efficient artificial intelligence (AI) increases, quantization allows models to run efficiently on edge hardware by reducing the precision of weights and activations - often with minimal impact on accuracy. This survey presents a tool-centric analysis of quantization support in twelve widely used embedded artificial intelligence (eAI) frameworks, including TensorFlow Lite, PyTorch, ONNX Runtime, and vendor-specific stacks like Qualcomm's QNN and Intel's OpenVINO. We examine how each tool implements quantization across several axes: supported workflows (post-training vs. quantization-aware training), bit-width flexibility, execution realism (simulated vs. integer kernels), and quantization granularity and schemes. Our findings reveal common patterns - such as the dominance of 8-bit uniform affine quantization - and highlight key distinctions in flexibility, deployment readiness, and hardware integration. We summarize our results in a unified comparison table to guide practitioners and researchers in selecting the most appropriate tool for their deployment needs. Finally, we discuss trends such as mixed-precision quantization and speculate on future directions for eAI tooling.
2025
2025 IEEE Annual Congress on Artificial Intelligence of Things (AIoT)
File in questo prodotto:
File Dimensione Formato  
A_Survey_of_Quantization_Techniques_in_Embedded_AI_Toolchains-2.pdf

Accesso riservato

Dimensione 1.06 MB
Formato Adobe PDF
1.06 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1309044
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? ND
social impact