Deep learning (DL) methods have recently gained popularity. Training this class of models is, however, computing-intensive, and frequently GPUs are used to boost performance. Although the costs of GPU-based systems are gradually reducing due to the high demand, they are still prohibitive: in public clouds, GPU-powered virtual machines (VMs) time unit price is 5-8x higher than CPU-only VMs. While the cloud remains the most cost-effective and flexible deployment, operation costs can be reduced, in large settings, by rightsizing and sharing resources among multiple processes. This work addresses the online joint capacity planning and job scheduling with due dates problem and proposes alternative matheuristic solution methods. Our objective is to optimize operation costs by: i) rightsizing the VM capacities at each node, ii) partitioning the set of GPUs among multiple concurrent jobs on the same VM, and iii) determining a due-date-aware job schedule. The effectiveness of the proposed hierarchical approach, coupled with an appropriate Mixed Integer Linear Programming formulation, is validated against first-principle methods by relying on simulation. The experiments prove that the efficiency of GPU-based systems evaluated in terms of costs can be improved by 50-70%. Finally, scalability analyses show that the proposed approach enables to solve problem instances with up to 100 nodes in less than one minute on average, making it suitable for practical scenarios.

Hierarchical Scheduling in on-demand GPU-as-a-Service Systems

Federica Filippini;Marco Lattuada;Danilo Ardagna;Edoardo Amaldi
2020-01-01

Abstract

Deep learning (DL) methods have recently gained popularity. Training this class of models is, however, computing-intensive, and frequently GPUs are used to boost performance. Although the costs of GPU-based systems are gradually reducing due to the high demand, they are still prohibitive: in public clouds, GPU-powered virtual machines (VMs) time unit price is 5-8x higher than CPU-only VMs. While the cloud remains the most cost-effective and flexible deployment, operation costs can be reduced, in large settings, by rightsizing and sharing resources among multiple processes. This work addresses the online joint capacity planning and job scheduling with due dates problem and proposes alternative matheuristic solution methods. Our objective is to optimize operation costs by: i) rightsizing the VM capacities at each node, ii) partitioning the set of GPUs among multiple concurrent jobs on the same VM, and iii) determining a due-date-aware job schedule. The effectiveness of the proposed hierarchical approach, coupled with an appropriate Mixed Integer Linear Programming formulation, is validated against first-principle methods by relying on simulation. The experiments prove that the efficiency of GPU-based systems evaluated in terms of costs can be improved by 50-70%. Finally, scalability analyses show that the proposed approach enables to solve problem instances with up to 100 nodes in less than one minute on average, making it suitable for practical scenarios.
2020
SYNASC 2020 Proceedings
File in questo prodotto:
File Dimensione Formato  
SYNASC2020-4.pdf

accesso aperto

Descrizione: Submitted version
: Post-Print (DRAFT o Author’s Accepted Manuscript-AAM)
Dimensione 766.77 kB
Formato Adobe PDF
766.77 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1153218
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? 2
social impact