RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

K-Means is a clustering technique widely employed in AI workloads, from image processing to data mining. Given its importance, researchers propose different algorithms and hardware-accelerated implementations. While algorithm suitability can depend on the target use case, there is much less doubt about the architecture: FPGAs are the de facto standard, as the design can be perfectly tailored to the target use case. Despite this, AI accelerators such as GPUs and Neural Processing Units (NPUs) are gaining traction. The former attains remarkable performance at the cost of low energy efficiency. The latter, instead, promises to maximize both, but they are strongly underutilized due to the lack of a clear approach for K-Means acceleration. Considering AMD NPU, for example, the main computing cores are AI Engines that require algorithm reshaping and code optimization to harness data parallelism effectively. Thus, this research analyzes different K-Means versions to propose a vectorized algorithm that fully uses AI Engine (AIE) features. We validate our vectorized K-Means on Versal VCK5000, using FPGAs for data movement only, as the Memory Transfer Engines and Shim Tiles of NPUs, and the AI Engine for computation. This design reflects features of modern NPUs, making the validation fair. We attain up to 59.5 × speedup against Torch library on GPUs while being comparable but more energy efficient than further optimized GPU solutions.

Accelerating K-Means: A Vectorized Approach for AI Engines & Neural Processing Units

Cabai, Eleonora;Sorrentino, Giuseppe;Santambrogio, Marco Domenico;Conficconi, Davide

2025-01-01

Abstract

K-Means is a clustering technique widely employed in AI workloads, from image processing to data mining. Given its importance, researchers propose different algorithms and hardware-accelerated implementations. While algorithm suitability can depend on the target use case, there is much less doubt about the architecture: FPGAs are the de facto standard, as the design can be perfectly tailored to the target use case. Despite this, AI accelerators such as GPUs and Neural Processing Units (NPUs) are gaining traction. The former attains remarkable performance at the cost of low energy efficiency. The latter, instead, promises to maximize both, but they are strongly underutilized due to the lack of a clear approach for K-Means acceleration. Considering AMD NPU, for example, the main computing cores are AI Engines that require algorithm reshaping and code optimization to harness data parallelism effectively. Thus, this research analyzes different K-Means versions to propose a vectorized algorithm that fully uses AI Engine (AIE) features. We validate our vectorized K-Means on Versal VCK5000, using FPGAs for data movement only, as the Memory Transfer Engines and Shim Tiles of NPUs, and the AI Engine for computation. This design reflects features of modern NPUs, making the validation fair. We attain up to 59.5 × speedup against Torch library on GPUs while being comparable but more energy efficient than further optimized GPU solutions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				2025 35th International Conference on Field-Programmable Logic and Applications (FPL)
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
KMeans_FPL.pdf accesso aperto : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 313.92 kB Formato Adobe PDF Visualizza/Apri	313.92 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1311651

Citazioni

ND

0

ND

ND

social impact