RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies.

Reliability-oriented resource management for High-Performance Computing

Massari, Giuseppe;Peta, Miriam;Campi, Alessandro;Reghenzani, Federico;Terraneo, Federico;Agosta, Giovanni;Fornaciari, William;Ciesielski, Sebastian;Kulczewski, Michal;Piatek, Wojciech

2023-01-01

Abstract

Reliability is an increasingly pressing issue for High-Performance Computing systems, as failures are a threat to large-scale applications, for which an even single run may incur significant energy and billing costs. Currently, application developers need to address reliability explicitly, by integrating application-specific checkpoint/restore mechanisms. However, the application alone cannot exploit system knowledge, which is not the case for system-wide resource management systems. In this paper, we propose a reliability-oriented policy that can increase significantly component reliability by combining checkpoint/restore mechanisms exploitation and proactive resource management policies.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Titolo della rivista
	
				SUSTAINABLE COMPUTING
			
	Parole chiave
	
				Reliability, HPC, Distributed systems, Resource management, Software simulators, Thermal management
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
2023_RECIPE_SusCom.pdf accesso aperto : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 2.71 MB Formato Adobe PDF Visualizza/Apri	2.71 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1236783

Citazioni

ND

6

6

social impact