RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the enduser and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%.

Modeling performance of Hadoop applications: A journey from queueing networks to stochastic well formed nets

ARDAGNA, DANILO;Bernardi, Simona;GIANNITI, EUGENIO;Aliabadi, Soroush Karimian;PEREZ PALACIN, DIEGO CARMELO;Requeno, José Ignacio

2016-01-01

Abstract

Nowadays, many enterprises commit to the extraction of actionable knowledge from huge datasets as part of their core business activities. Applications belong to very different domains such as fraud detection or one-to-one marketing, and encompass business analytics and support to decision making in both private and public sectors. In these scenarios, a central place is held by the MapReduce framework and in particular its open source implementation, Apache Hadoop. In such environments, new challenges arise in the area of jobs performance prediction, with the needs to provide Service Level Agreement guarantees to the enduser and to avoid waste of computational resources. In this paper we provide performance analysis models to estimate MapReduce job execution times in Hadoop clusters governed by the YARN Capacity Scheduler. We propose models of increasing complexity and accuracy, ranging from queueing networks to stochastic well formed nets, able to estimate job performance under a number of scenarios of interest, including also unreliable resources. The accuracy of our models is evaluated by considering the TPC-DS industry benchmark running experiments on Amazon EC2 and the CINECA Italian supercomputing center. The results have shown that the average accuracy we can achieve is in the range 9–14%.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2016
			
	Titolo del libro
	
				ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2016
			
	Titolo della collana
	
				LECTURE NOTES IN COMPUTER SCIENCE
			
	ISBN (International Standard Book Number)
	
				9783319495828
9783319495828
			
	Parole chiave
	
				MapReduce; Performance models; Theoretical Computer Science; Computer Science (all)
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
ICA3PP16-ZAR.pdf accesso aperto Descrizione: Articolo : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 470.16 kB Formato Adobe PDF Visualizza/Apri	470.16 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1007275

Citazioni

ND

27

20

ND

social impact