Resiliency in numerical algorithm design for extreme scale simulations

Agullo, E.; Altenbernd, M.; Anzt, H.; Bautista-Gomez, L.; Benacchio, T.; Bonaventura, L.; Bungartz, H. -J.; Chatterjee, S.; Ciorba, F. M.; Debardeleben, N.; Drzisga, D.; Eibl, S.; Engelmann, C.; Gansterer, W. N.; Giraud, L.; Goddeke, D.; Heisig, M.; Jezequel, F.; Kohl, N.; X. S., Li; Lion, R.; Mehl, M.; Mycek, P.; Obersteiner, M.; Quintana-Orti, E. S.; Rizzi, F.; Rude, U.; Schulz, M.; Fung, F.; Speck, R.; Stals, L.; Teranishi, K.; Thibault, S.; Thonnes, D.; Wagner, A.; Wohlmuth, B.

doi:10.1177/10943420211055188

This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.

Resiliency in numerical algorithm design for extreme scale simulations

Agullo E.;Altenbernd M.;Anzt H.;Bautista-Gomez L.;Benacchio T.;Bonaventura L.;Bungartz H. -J.;Chatterjee S.;Ciorba F. M.;DeBardeleben N.;Drzisga D.;Eibl S.;Engelmann C.;Gansterer W. N.;Giraud L.;Goddeke D.;Heisig M.;Jezequel F.;Kohl N.;Li X. S.;Lion R.;Mehl M.;Mycek P.;Obersteiner M.;Quintana-Orti E. S.;Rizzi F.;Rude U.;Schulz M.;Fung F.;Speck R.;Stals L.;Teranishi K.;Thibault S.;Thonnes D.;Wagner A.;Wohlmuth B.

2022-01-01

Abstract

This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo della rivista
	
				INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS
			
	Parole chiave
	
				fault tolerance
Numerical algorithms
parallel computer architecture
resilience
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
agullo_bonaventura_etal_ijhpc_2022.pdf Accesso riservato : Publisher’s version Dimensione 1.08 MB Formato Adobe PDF Visualizza/Apri	1.08 MB	Adobe PDF	Visualizza/Apri
11311-1202623_Bonaventura.pdf accesso aperto : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 1.13 MB Formato Adobe PDF Visualizza/Apri	1.13 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1202623

Citazioni

ND

4

4

ND

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Resiliency in numerical algorithm design for extreme scale simulations

2022-01-01

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

Citazioni

social impact

RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Resiliency in numerical algorithm design for extreme scale simulations

2022-01-01

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Informazioni

Citazioni

social impact

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)