RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Spark is an in-memory framework for implementing distributed applications of various types. Predicting the execution time of Spark applications is an important but challenging problem that has been tackled in the past few years by several studies; most of them achieving good prediction accuracy on simple applications (e.g. known ML algorithms or SQL-based applications). In this work, we consider complex data-driven workflow applications, in which the execution and data flow can be modeled by Directly Acyclic Graphs (DAGs). Workflows can be made of an arbitrary combination of known tasks, each applying a set of Spark operations to their input data. By adopting a hybrid approach, combining analytical and machine learning (ML) models, trained on small DAGs, we can predict, with good accuracy, the execution time of unseen workflows of higher complexity and size. We validate our approach through an extensive experimentation on real-world complex applications, comparing different ML models and choices of feature sets.

Performance Prediction for Data-driven Workflows on Apache Spark

Andrea Gulino;Arif Canakoglu;Stefano Ceri;Danilo Ardagna

2020-01-01

Abstract

Spark is an in-memory framework for implementing distributed applications of various types. Predicting the execution time of Spark applications is an important but challenging problem that has been tackled in the past few years by several studies; most of them achieving good prediction accuracy on simple applications (e.g. known ML algorithms or SQL-based applications). In this work, we consider complex data-driven workflow applications, in which the execution and data flow can be modeled by Directly Acyclic Graphs (DAGs). Workflows can be made of an arbitrary combination of known tasks, each applying a set of Spark operations to their input data. By adopting a hybrid approach, combining analytical and machine learning (ML) models, trained on small DAGs, we can predict, with good accuracy, the execution time of unseen workflows of higher complexity and size. We validate our approach through an extensive experimentation on real-world complex applications, comparing different ML models and choices of feature sets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Titolo del libro
	
				2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
			
	ISBN (International Standard Book Number)
	
				978-1-7281-9239-0
			
	Parole chiave
	
				performance prediction , workflow applications , Spark , machine learning
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
11311-1158359_Gulino.pdf accesso aperto : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 2.17 MB Formato Adobe PDF Visualizza/Apri	2.17 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1158359

Citazioni

ND

8

5

ND

social impact