RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Data analysts predict that the GPU as a Service (GPUaaS) market will grow from US$700 million in 2019 to $7 billion in 2025 with a compound annual growth rate of over 38% to support 3D models, animated video processing, and gaming. GPUaaS adoption will be also boosted by the use of graphics processing units (GPUs) to support Deep learning (DL) model training. Indeed, nowadays, the main cloud providers already offer in their catalogs GPU-based virtual machines pre-installed with the popular DL framework (like Torch, PyTorch, TensorFlow, and Caffe) simplifying DL model programming operations. Motivated by these considerations, this paper studies GPU-deployed neural networks (NNs) and tackles the issue of performance prediction, particularly with respect to NN training times. The proposed approach is based on machine learning and exploits two main sets of features which describe, on one hand, the network architecture and the hyper-parameters, on the other, the hardware characteristics of the target deployment. Such data enable the learning of multiple linear regression models, which, coupled with an established feature selection technique, become accurate prediction tools, with errors below 11 % on average. An extensive experimental campaign, performed both on public and in-house private cloud deployments, considers popular deep NNs used for image classification and speech transcription and shows that prediction errors remain small even when extrapolating outside the range spanned by the input data. This has important implications for the models’ applicability: in this way, it is possible to investigate the impact on the performance of different GPUaaS deployment or hardware upgrades even without conducting an empirical investigation on the specific target device or to evaluate the changes in training time when the number of inner modules in the deep neural networks varies.

Performance Prediction of Deep Learning Applications Training in GPU as a Service Systems

Marco Lattuada;Eugenio Gianniti;Danilo Ardagna;Li Zhang

2022-01-01

Abstract

Data analysts predict that the GPU as a Service (GPUaaS) market will grow from US$700 million in 2019 to $7 billion in 2025 with a compound annual growth rate of over 38% to support 3D models, animated video processing, and gaming. GPUaaS adoption will be also boosted by the use of graphics processing units (GPUs) to support Deep learning (DL) model training. Indeed, nowadays, the main cloud providers already offer in their catalogs GPU-based virtual machines pre-installed with the popular DL framework (like Torch, PyTorch, TensorFlow, and Caffe) simplifying DL model programming operations. Motivated by these considerations, this paper studies GPU-deployed neural networks (NNs) and tackles the issue of performance prediction, particularly with respect to NN training times. The proposed approach is based on machine learning and exploits two main sets of features which describe, on one hand, the network architecture and the hyper-parameters, on the other, the hardware characteristics of the target deployment. Such data enable the learning of multiple linear regression models, which, coupled with an established feature selection technique, become accurate prediction tools, with errors below 11 % on average. An extensive experimental campaign, performed both on public and in-house private cloud deployments, considers popular deep NNs used for image classification and speech transcription and shows that prediction errors remain small even when extrapolating outside the range spanned by the input data. This has important implications for the models’ applicability: in this way, it is possible to investigate the impact on the performance of different GPUaaS deployment or hardware upgrades even without conducting an empirical investigation on the specific target device or to evaluate the changes in training time when the number of inner modules in the deep neural networks varies.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2022
			
	Titolo della rivista
	
				CLUSTER COMPUTING
			
	Parole chiave
	
				GPU
Performance models
Machine learning
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
Clus.pdf accesso aperto : Pre-Print (o Pre-Refereeing) Dimensione 1.3 MB Formato Adobe PDF Visualizza/Apri	1.3 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1191540

Citazioni

ND

23

15

social impact