Large-scale disasters affecting both network and datacenter (DC) infrastructures can cause severe disruptions in cloud-based services. During post-disaster recovery, repairs are usually carried out in stages in a progressive manner due to limited repair resource availability. The order in which network elements and DCs are repaired can significantly impact users' reachability to important contents/services. We investigate joint progressive network and DC recovery in which network recovery and DC recovery are conducted in a coordinated manner such that users have access to the maximum possible amount of contents/services at each repair stage. We first solve the optimization problem of joint progressive recovery to find the optimal sequence of network element and DC repairs with the objective to maximize cumulative weighted content reachability in the network. We then propose a scalable heuristic for scheduling the sequential repair of network nodes/links and DCs. Our model assumes that, at each repair stage, one network node with adjacent links and one DC can be fully repaired; however, full recovery may not be guaranteed due to limited resource availability. Hence, we also propose a 'resource-aware' approach (with two resource-allocation strategies, namely 'selective allocation' and 'adaptive allocation'), which considers both full and partial recovery of elements based on available resources at each stage. We show that, compared to disjoint progressive recovery approach, in which network recovery and DC recovery plans are independent, our joint progressive recovery approach provides significantly higher per-stage content reachability in the network.

Joint Progressive Network and Datacenter Recovery after Large-Scale Disasters

Tornatore M.;
2020-01-01

Abstract

Large-scale disasters affecting both network and datacenter (DC) infrastructures can cause severe disruptions in cloud-based services. During post-disaster recovery, repairs are usually carried out in stages in a progressive manner due to limited repair resource availability. The order in which network elements and DCs are repaired can significantly impact users' reachability to important contents/services. We investigate joint progressive network and DC recovery in which network recovery and DC recovery are conducted in a coordinated manner such that users have access to the maximum possible amount of contents/services at each repair stage. We first solve the optimization problem of joint progressive recovery to find the optimal sequence of network element and DC repairs with the objective to maximize cumulative weighted content reachability in the network. We then propose a scalable heuristic for scheduling the sequential repair of network nodes/links and DCs. Our model assumes that, at each repair stage, one network node with adjacent links and one DC can be fully repaired; however, full recovery may not be guaranteed due to limited resource availability. Hence, we also propose a 'resource-aware' approach (with two resource-allocation strategies, namely 'selective allocation' and 'adaptive allocation'), which considers both full and partial recovery of elements based on available resources at each stage. We show that, compared to disjoint progressive recovery approach, in which network recovery and DC recovery plans are independent, our joint progressive recovery approach provides significantly higher per-stage content reachability in the network.
2020
Cloud network
content reachability
progressive disaster recovery
resource allocation
File in questo prodotto:
File Dimensione Formato  
Ferdousi_TNSM_20.pdf

Accesso riservato

Descrizione: Ferdousi_TNSM_20
: Publisher’s version
Dimensione 1.99 MB
Formato Adobe PDF
1.99 MB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1165587
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 27
  • ???jsp.display-item.citation.isi??? 13
social impact