K-Means is a clustering technique widely employed in AI workloads, from image processing to data mining. Given its importance, researchers propose different algorithms and hardware-accelerated implementations. While algorithm suitability can depend on the target use case, there is much less doubt about the architecture: FPGAs are the de facto standard, as the design can be perfectly tailored to the target use case. Despite this, AI accelerators such as GPUs and Neural Processing Units (NPUs) are gaining traction. The former attains remarkable performance at the cost of low energy efficiency. The latter, instead, promises to maximize both, but they are strongly underutilized due to the lack of a clear approach for K-Means acceleration. Considering AMD NPU, for example, the main computing cores are AI Engines that require algorithm reshaping and code optimization to harness data parallelism effectively. Thus, this research analyzes different K-Means versions to propose a vectorized algorithm that fully uses AI Engine (AIE) features. We validate our vectorized K-Means on Versal VCK5000, using FPGAs for data movement only, as the Memory Transfer Engines and Shim Tiles of NPUs, and the AI Engine for computation. This design reflects features of modern NPUs, making the validation fair. We attain up to 59.5 × speedup against Torch library on GPUs while being comparable but more energy efficient than further optimized GPU solutions.

Accelerating K-Means: A Vectorized Approach for AI Engines & Neural Processing Units

Cabai, Eleonora;Sorrentino, Giuseppe;Santambrogio, Marco Domenico;Conficconi, Davide
2025-01-01

Abstract

K-Means is a clustering technique widely employed in AI workloads, from image processing to data mining. Given its importance, researchers propose different algorithms and hardware-accelerated implementations. While algorithm suitability can depend on the target use case, there is much less doubt about the architecture: FPGAs are the de facto standard, as the design can be perfectly tailored to the target use case. Despite this, AI accelerators such as GPUs and Neural Processing Units (NPUs) are gaining traction. The former attains remarkable performance at the cost of low energy efficiency. The latter, instead, promises to maximize both, but they are strongly underutilized due to the lack of a clear approach for K-Means acceleration. Considering AMD NPU, for example, the main computing cores are AI Engines that require algorithm reshaping and code optimization to harness data parallelism effectively. Thus, this research analyzes different K-Means versions to propose a vectorized algorithm that fully uses AI Engine (AIE) features. We validate our vectorized K-Means on Versal VCK5000, using FPGAs for data movement only, as the Memory Transfer Engines and Shim Tiles of NPUs, and the AI Engine for computation. This design reflects features of modern NPUs, making the validation fair. We attain up to 59.5 × speedup against Torch library on GPUs while being comparable but more energy efficient than further optimized GPU solutions.
2025
2025 35th International Conference on Field-Programmable Logic and Applications (FPL)
File in questo prodotto:
File Dimensione Formato  
KMeans_FPL.pdf

accesso aperto

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 313.92 kB
Formato Adobe PDF
313.92 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1311651
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact