RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Deep Learning applications are pervasive today, and efficient strategies are designed to reduce the computational time and resource demand of the training process. The Distributed Deep Learning (DDL) paradigm yields a significant speed-up by partitioning the training into multiple, parallel tasks. The Ray framework supports DDL applications exploiting data parallelism by enhancing the scalability with minimal user effort. This work aims at evaluating the performance of DDL training applications, by profiling their execution on a Ray cluster and developing Machine Learning-based models to predict the training time when changing the dataset size, the number of parallel workers and the amount of computational resources. Such performance-prediction models are crucial to forecast computational resources usage and costs in Cloud environments. Experimental results prove that our models achieve average prediction errors between 3 and 15% for both interpolation and extrapolation, thus demonstrating their applicability to unforeseen scenarios.

Performance Models for Distributed Deep Learning Training Jobs on Ray

Federica Filippini;Boris Lublinsky;Maximilien de Bayser;Danilo Ardagna

In corso di stampa

Abstract

Deep Learning applications are pervasive today, and efficient strategies are designed to reduce the computational time and resource demand of the training process. The Distributed Deep Learning (DDL) paradigm yields a significant speed-up by partitioning the training into multiple, parallel tasks. The Ray framework supports DDL applications exploiting data parallelism by enhancing the scalability with minimal user effort. This work aims at evaluating the performance of DDL training applications, by profiling their execution on a Ray cluster and developing Machine Learning-based models to predict the training time when changing the dataset size, the number of parallel workers and the amount of computational resources. Such performance-prediction models are crucial to forecast computational resources usage and costs in Cloud environments. Experimental results prove that our models achieve average prediction errors between 3 and 15% for both interpolation and extrapolation, thus demonstrating their applicability to unforeseen scenarios.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				In corso di stampa
			
	Titolo del libro
	
				2023 49th Euromicro Conference on Software Engineering and Advanced Applications (SEAA)
			
	Parole chiave
	
				Distributed training, Performance models, Ray
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
IBM_FFilippini.pdf accesso aperto : Pre-Print (o Pre-Refereeing) Dimensione 11.84 MB Formato Adobe PDF Visualizza/Apri	11.84 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1256108

Citazioni

ND

ND

ND

social impact