RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Multi-GPU architectures are increasingly being deployed in cloud data centers, but using GPUs efficiently from high-level programming languages remains a challenge. Moreover, exploiting the full capabilities of multi-GPU systems is an arduous task due to the complex interconnection topology between available accelerators and the variety of inter-GPU communication patterns exhibited by different workloads. This work introduces a novel scheduler for multi-task GPU computations that provides transparent asynchronous execution on multi-GPU systems without requiring prior information about the program dependencies or the underlying system architecture. It integrates with the polyglot GraalVM ecosystem and is therefore available for multiple high-level languages, providing a general framework that can significantly lower the barriers to entry to multi-GPU acceleration. We validate our work on representative workloads to investigate scalability and inter-GPU communication. Experimental results show how our scheduler automatically achieves 80-90% peak performance against hand-optimized CUDA host code on Volta and Ampere multi-GPU systems.

Multi-GPU Greedy Scheduling Through a Polyglot Runtime

Di Dio Lavore, Ian;Di Donato, Guido Walter;Parravicini, Alberto;Sgherzi, Francesco;Bonetta, Daniele;Santambrogio, Marco Domenico

2025-01-01

Abstract

Multi-GPU architectures are increasingly being deployed in cloud data centers, but using GPUs efficiently from high-level programming languages remains a challenge. Moreover, exploiting the full capabilities of multi-GPU systems is an arduous task due to the complex interconnection topology between available accelerators and the variety of inter-GPU communication patterns exhibited by different workloads. This work introduces a novel scheduler for multi-task GPU computations that provides transparent asynchronous execution on multi-GPU systems without requiring prior information about the program dependencies or the underlying system architecture. It integrates with the polyglot GraalVM ecosystem and is therefore available for multiple high-level languages, providing a general framework that can significantly lower the barriers to entry to multi-GPU acceleration. We validate our work on representative workloads to investigate scalability and inter-GPU communication. Experimental results show how our scheduler automatically achieves 80-90% peak performance against hand-optimized CUDA host code on Volta and Ampere multi-GPU systems.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				CF '25: Proceedings of the 22nd ACM International Conference on Computing Frontiers
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1294886

Citazioni

ND

0

0

social impact