Modern supercomputers are becoming increasingly dense with accelerators. Industry leaders offer multi-GPU architectures with high interconnection bandwidth between the devices to match the requirements of modern workloads. While those technologies advance, it is up to the programmer to successfully exploit them. Recognizing this burden, multiple abstractions have been built. We focus on the NVIDIA Collective Communication Library (NCCL) and Unified Memory (UM). The former provides MPI-like directives integrated within the GPU runtime, allowing lower latencies and increasing the bandwidth over previous approaches. The latter simplifies the programming paradigm, offering a unified virtual address space. Moreover, it enables memory oversubscription, drastically reducing the efforts towards handling larger problems without completely restructuring the codebase. This work provides the first joint analysis of NCCL and UM from single-node multi-GPU architectures to a production supercomputer. We explore all the available collective communication directives concerning their power requirements and overall throughput. Moreover, we study the effects of various hyperparameters, e.g., message sizes, oversubscription level, and memory advice, on the overall obtainable performance. Our findings showcase how using UM brings negligible increased energy consumption; moreover, in distributed settings, other restricting factors, such as network bottlenecks, surpass the overhead introduced by UM’s page-eviction mechanisms.

On the Effectiveness of Unified Memory in Multi-GPU Collective Communication

Lavore, Ian Di Dio;Santambrogio, Marco;
2025-01-01

Abstract

Modern supercomputers are becoming increasingly dense with accelerators. Industry leaders offer multi-GPU architectures with high interconnection bandwidth between the devices to match the requirements of modern workloads. While those technologies advance, it is up to the programmer to successfully exploit them. Recognizing this burden, multiple abstractions have been built. We focus on the NVIDIA Collective Communication Library (NCCL) and Unified Memory (UM). The former provides MPI-like directives integrated within the GPU runtime, allowing lower latencies and increasing the bandwidth over previous approaches. The latter simplifies the programming paradigm, offering a unified virtual address space. Moreover, it enables memory oversubscription, drastically reducing the efforts towards handling larger problems without completely restructuring the codebase. This work provides the first joint analysis of NCCL and UM from single-node multi-GPU architectures to a production supercomputer. We explore all the available collective communication directives concerning their power requirements and overall throughput. Moreover, we study the effects of various hyperparameters, e.g., message sizes, oversubscription level, and memory advice, on the overall obtainable performance. Our findings showcase how using UM brings negligible increased energy consumption; moreover, in distributed settings, other restricting factors, such as network bottlenecks, surpass the overhead introduced by UM’s page-eviction mechanisms.
2025
2025 33rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1294885
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact