Modern commodity and High-Performance Computing (HPC) systems are evolving with complex CPU architectures. These architectures now feature higher core and NUMA domain counts and implement features such as hyperthreading. When considering significant differences in hardware configurations, library availability, and hardware-tailored system/software stacks, which could substantially vary from one system to another, performance portability is hard to achieve. Throughout the years, this trend resulted in an increasingly high burden on application developers to fine-tune their workloads for each architecture. This work explores how hardware-dependent aspects such as locality/process/thread affinity affect performance in modern CPU architectures. We focus our study on the Global Memory and Threading (GMT) distributed runtime system as a representative of Partitioned Global Address Space (PGAS) software stacks commonly adopted for productivity. In particular, to appreciate performance implications, we evaluate GMT’s thread affinity policies, and, introduce two new ones which exploit architectural awareness. Finally, we explore alternative NUMA configurations via different process bindings and perform a scalability study on three HPC clusters with varying CPU architectures and NUMA layouts. Our analysis indicates that more complex architectures are more affected by affinity and binding policies and highlights the importance of setting proper runtime configurations to achieve superior performance.

Exploring Architectural-Aware Affinity Policies in Modern HPC Runtimes

Di Dio Lavore, Ian;Santambrogio, Marco
2024-01-01

Abstract

Modern commodity and High-Performance Computing (HPC) systems are evolving with complex CPU architectures. These architectures now feature higher core and NUMA domain counts and implement features such as hyperthreading. When considering significant differences in hardware configurations, library availability, and hardware-tailored system/software stacks, which could substantially vary from one system to another, performance portability is hard to achieve. Throughout the years, this trend resulted in an increasingly high burden on application developers to fine-tune their workloads for each architecture. This work explores how hardware-dependent aspects such as locality/process/thread affinity affect performance in modern CPU architectures. We focus our study on the Global Memory and Threading (GMT) distributed runtime system as a representative of Partitioned Global Address Space (PGAS) software stacks commonly adopted for productivity. In particular, to appreciate performance implications, we evaluate GMT’s thread affinity policies, and, introduce two new ones which exploit architectural awareness. Finally, we explore alternative NUMA configurations via different process bindings and perform a scalability study on three HPC clusters with varying CPU architectures and NUMA layouts. Our analysis indicates that more complex architectures are more affected by affinity and binding policies and highlights the importance of setting proper runtime configurations to achieve superior performance.
2024
Practice and Experience in Advanced Research Computing 2024: Human Powered Computing
9798400704192
Distributed Runtime Systems; High-Performance Computing; Process Binding; Thread Affinity
File in questo prodotto:
File Dimensione Formato  
gmt_eval_PEARC24_camera_ready.pdf

accesso aperto

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 604.66 kB
Formato Adobe PDF
604.66 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1270782
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact