RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

In the era of Big Data, whose digital industry is facing the massive growth of data size and development of data intensive software, more and more companies are moving to use new frameworks and paradigms capable of handling data at scale. The outstanding MapRe- duce (MR) paradigm and its implementation framework, Hadoop are among the most re- ferred ones, and basis for later and more advanced frameworks like Tez and Spark. Accurate prediction of the execution time of a Big Data application helps improving design time de- cisions, reduces over allocation charges, and assists budget management. In this regard, we propose analytical models based on the Stochastic Activity Networks (SANs) to accurately model the execution of MR, Tez and Spark applications in Hadoop environments governed by the YARN Capacity scheduler. We evaluate the accuracy of the proposed models over the TPC-DS industry benchmark across different configurations. Results obtained by numeri- cally solving analytical SAN models show an average error of 6% in estimating the execution time of an application compared to the data gathered from experiments and moreover the model evaluation time is lower than simulation time of state of the art solutions.

Analytical composite performance models for Big Data applications

S. Karimian-Aliabadia;D. Ardagna;R. Entezari-Maleki;E. Gianniti;A. Movaghar

2019-01-01

Abstract

In the era of Big Data, whose digital industry is facing the massive growth of data size and development of data intensive software, more and more companies are moving to use new frameworks and paradigms capable of handling data at scale. The outstanding MapRe- duce (MR) paradigm and its implementation framework, Hadoop are among the most re- ferred ones, and basis for later and more advanced frameworks like Tez and Spark. Accurate prediction of the execution time of a Big Data application helps improving design time de- cisions, reduces over allocation charges, and assists budget management. In this regard, we propose analytical models based on the Stochastic Activity Networks (SANs) to accurately model the execution of MR, Tez and Spark applications in Hadoop environments governed by the YARN Capacity scheduler. We evaluate the accuracy of the proposed models over the TPC-DS industry benchmark across different configurations. Results obtained by numeri- cally solving analytical SAN models show an average error of 6% in estimating the execution time of an application compared to the data gathered from experiments and moreover the model evaluation time is lower than simulation time of state of the art solutions.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2019
			
	Titolo della rivista
	
				JOURNAL OF NETWORK AND COMPUTER APPLICATIONS
			
	Parole chiave
	
				Big Data, MapReduce, Apache Spark, Performance Evaluation, Stochastic Activity Network
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
JNCA.pdf accesso aperto Descrizione: Articolo principale : Pre-Print (o Pre-Refereeing) Dimensione 2.71 MB Formato Adobe PDF Visualizza/Apri	2.71 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1090447

Citazioni

ND

16

14

ND

social impact