Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies.

Reliability-oriented resource management for High-Performance Computing

Massari, Giuseppe;Peta, Miriam;Campi, Alessandro;Reghenzani, Federico;Terraneo, Federico;Agosta, Giovanni;Fornaciari, William;
2023-01-01

Abstract

Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies.
2023
Reliability, HPC, Distributed systems, Resource management, Software simulators, Thermal management
File in questo prodotto:
File Dimensione Formato  
2023_RECIPE_SusCom.pdf

accesso aperto

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 2.71 MB
Formato Adobe PDF
2.71 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1236783
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact