K-Means is a clustering technique widely employed in AI workloads, from image processing to data mining. Given its importance, researchers propose different algorithms and hardware-accelerated implementations. While algorithm suitability can depend on the target use case, there is much less doubt about the architecture: FPGAs are the de facto standard, as the design can be perfectly tailored to the target use case. Despite this, AI accelerators such as GPUs and Neural Processing Units (NPUs) are gaining traction. The former attains remarkable performance at the cost of low energy efficiency. The latter, instead, promises to maximize both, but they are strongly underutilized due to the lack of a clear approach for K-Means acceleration. Considering AMD NPU, for example, the main computing cores are AI Engines that require algorithm reshaping and code optimization to harness data parallelism effectively. Thus, this research analyzes different K-Means versions to propose a vectorized algorithm that fully uses AI Engine (AIE) features. We validate our vectorized K-Means on Versal VCK5000, using FPGAs for data movement only, as the Memory Transfer Engines and Shim Tiles of NPUs, and the AI Engine for computation. This design reflects features of modern NPUs, making the validation fair. We attain up to 59.5 × speedup against Torch library on GPUs while being comparable but more energy efficient than further optimized GPU solutions.
Accelerating K-Means: A Vectorized Approach for AI Engines & Neural Processing Units
Cabai, Eleonora;Sorrentino, Giuseppe;Santambrogio, Marco Domenico;Conficconi, Davide
2025-01-01
Abstract
K-Means is a clustering technique widely employed in AI workloads, from image processing to data mining. Given its importance, researchers propose different algorithms and hardware-accelerated implementations. While algorithm suitability can depend on the target use case, there is much less doubt about the architecture: FPGAs are the de facto standard, as the design can be perfectly tailored to the target use case. Despite this, AI accelerators such as GPUs and Neural Processing Units (NPUs) are gaining traction. The former attains remarkable performance at the cost of low energy efficiency. The latter, instead, promises to maximize both, but they are strongly underutilized due to the lack of a clear approach for K-Means acceleration. Considering AMD NPU, for example, the main computing cores are AI Engines that require algorithm reshaping and code optimization to harness data parallelism effectively. Thus, this research analyzes different K-Means versions to propose a vectorized algorithm that fully uses AI Engine (AIE) features. We validate our vectorized K-Means on Versal VCK5000, using FPGAs for data movement only, as the Memory Transfer Engines and Shim Tiles of NPUs, and the AI Engine for computation. This design reflects features of modern NPUs, making the validation fair. We attain up to 59.5 × speedup against Torch library on GPUs while being comparable but more energy efficient than further optimized GPU solutions.| File | Dimensione | Formato | |
|---|---|---|---|
|
KMeans_FPL.pdf
accesso aperto
:
Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione
313.92 kB
Formato
Adobe PDF
|
313.92 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


