Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.

Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

Cilardo, Alessandro;Massari, Giuseppe;Reghenzani, Federico;Fornaciari, William;
2020-01-01

Abstract

Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.
2020
reliability, exascale, performance, power, energy, fault, design space ezploration, run-time management, mixed criticality
File in questo prodotto:
File Dimensione Formato  
csur2020.pdf

Accesso riservato

Descrizione: versione pubblicata
: Publisher’s version
Dimensione 1.41 MB
Formato Adobe PDF
1.41 MB Adobe PDF   Visualizza/Apri
11311-1147010_Fornaciari.pdf

accesso aperto

: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 660.63 kB
Formato Adobe PDF
660.63 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1147010
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? 9
social impact