Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelerate the training process, expanding the range of problems that can be solved in a reasonable computing time. As a consequence, the demand for high-performance GPU-based cloud servers increased dramatically, dictating the necessity for Cloud Service Providers (CSPs) to exploit effective resource management strategies. In this work, we optimize the scheduling of DL training jobs from the perspective of a CSP running a data center, efficiently selecting resources for the execution of each job in order to minimize the average energy consumption. We develop a Mixed-Integer Non-Linear Programming (MINLP) formulation to model the problem, and a heuristic STochastic Scheduler (STS) that, exploiting the probability distribution of early termination, determines how to vary the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the jobs due dates. The results of an extensive experimental campaign show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due dates violations and yielding an average total cost reduction between 38% and 80%. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, that is, running multiple jobs in a single GPU. The results demonstrate that, depending on the workload and GPU memory, this further reduces the energy cost by 17-29% on average.

A Stochastic Approach for Scheduling AI Training Jobs in GPU-based Systems

Federica Filippini;Jonatha Anselmi;Danilo Ardagna;
2024-01-01

Abstract

Deep Learning (DL) methods currently address a variety of complex tasks. GPUs significantly accelerate the training process, expanding the range of problems that can be solved in a reasonable computing time. As a consequence, the demand for high-performance GPU-based cloud servers increased dramatically, dictating the necessity for Cloud Service Providers (CSPs) to exploit effective resource management strategies. In this work, we optimize the scheduling of DL training jobs from the perspective of a CSP running a data center, efficiently selecting resources for the execution of each job in order to minimize the average energy consumption. We develop a Mixed-Integer Non-Linear Programming (MINLP) formulation to model the problem, and a heuristic STochastic Scheduler (STS) that, exploiting the probability distribution of early termination, determines how to vary the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the jobs due dates. The results of an extensive experimental campaign show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due dates violations and yielding an average total cost reduction between 38% and 80%. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, that is, running multiple jobs in a single GPU. The results demonstrate that, depending on the workload and GPU memory, this further reduces the energy cost by 17-29% on average.
2024
Deep Learning, GPU cluster, Scheduling, Average energy consumption minimization, GPU sharing
File in questo prodotto:
File Dimensione Formato  
49889132_File000000_1223348787 2.pdf

accesso aperto

: Pre-Print (o Pre-Refereeing)
Dimensione 1.43 MB
Formato Adobe PDF
1.43 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1256297
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact