RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Modern supercomputers are becoming increasingly dense with accelerators. Industry leaders offer multi-GPU architectures with high interconnection bandwidth between the devices to match the requirements of modern workloads. While those technologies advance, it is up to the programmer to successfully exploit them. Recognizing this burden, multiple abstractions have been built. We focus on the NVIDIA Collective Communication Library (NCCL) and Unified Memory (UM). The former provides MPI-like directives integrated within the GPU runtime, allowing lower latencies and increasing the bandwidth over previous approaches. The latter simplifies the programming paradigm, offering a unified virtual address space. Moreover, it enables memory oversubscription, drastically reducing the efforts towards handling larger problems without completely restructuring the codebase. This work provides the first joint analysis of NCCL and UM from single-node multi-GPU architectures to a production supercomputer. We explore all the available collective communication directives concerning their power requirements and overall throughput. Moreover, we study the effects of various hyperparameters, e.g., message sizes, oversubscription level, and memory advice, on the overall obtainable performance. Our findings showcase how using UM brings negligible increased energy consumption; moreover, in distributed settings, other restricting factors, such as network bottlenecks, surpass the overhead introduced by UM’s page-eviction mechanisms.

On the Effectiveness of Unified Memory in Multi-GPU Collective Communication

Strina, Riccardo;Lavore, Ian Di Dio;Santambrogio, Marco;Papka, Michael;Lan, Zhiling

2025-01-01

Abstract

Modern supercomputers are becoming increasingly dense with accelerators. Industry leaders offer multi-GPU architectures with high interconnection bandwidth between the devices to match the requirements of modern workloads. While those technologies advance, it is up to the programmer to successfully exploit them. Recognizing this burden, multiple abstractions have been built. We focus on the NVIDIA Collective Communication Library (NCCL) and Unified Memory (UM). The former provides MPI-like directives integrated within the GPU runtime, allowing lower latencies and increasing the bandwidth over previous approaches. The latter simplifies the programming paradigm, offering a unified virtual address space. Moreover, it enables memory oversubscription, drastically reducing the efforts towards handling larger problems without completely restructuring the codebase. This work provides the first joint analysis of NCCL and UM from single-node multi-GPU architectures to a production supercomputer. We explore all the available collective communication directives concerning their power requirements and overall throughput. Moreover, we study the effects of various hyperparameters, e.g., message sizes, oversubscription level, and memory advice, on the overall obtainable performance. Our findings showcase how using UM brings negligible increased energy consumption; moreover, in distributed settings, other restricting factors, such as network bottlenecks, surpass the overhead introduced by UM’s page-eviction mechanisms.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1294885

Citazioni

ND

0

ND

social impact