RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware re-liability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case

The RECIPE Approach to Challenges in Deeply Heterogeneous High Performance Systems

Agosta, Giovanni;Fornaciari, William;Cilardo, Alessandro;Massari, Giuseppe;
2020-01-01

Abstract

RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximizing hardware lifetime and guarantee application performance is identified as the key concern for RECIPE. We address it through hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware re-liability obtained respectively through timing analysis and modeling thermal properties and mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case
2020
HPC, Heterogeneous computing, Run-time management
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0141933120303525-main.pdf

Open Access dal 01/10/2021

Descrizione: pre proof
: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 3.68 MB
Formato Adobe PDF
3.68 MB Adobe PDF Visualizza/Apri
1-s2.0-S0141933120303525-main.pdf

Accesso riservato

Descrizione: versione pubblicata
: Publisher’s version
Dimensione 3.06 MB
Formato Adobe PDF
3.06 MB Adobe PDF   Visualizza/Apri
RECIPE_MICPRO_Preprint.pdf

accesso aperto

Descrizione: camera ready
: Pre-Print (o Pre-Refereeing)
Dimensione 2.84 MB
Formato Adobe PDF
2.84 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1141354
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 11
  • ???jsp.display-item.citation.isi??? 10
social impact