Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the option framework, previous works have conceived provably efficient algorithms for the case in which the options are *fixed* and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which both the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, in this paper, we propose a novel meta-algorithm that alternates between two regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we look at the problem as a Semi-Markov Decision Process (SMDP), keeping the low-level policies fixed, while at a lower level, we learn the inner option policies by keeping the high-level policy fixed. Then, we specialize the results for a specific choice of algorithms, where we propose a novel provably efficient algorithm for the finite-horizon SMDPs, and we use a state-of-the-art regret minimizer for the options learning. We compare the bounds derived with those of state-of-the-art regret minimization algorithms for non-hierarchical finite-horizon problems. The comparison allows us to characterize the class of problems in which a hierarchical approach is provably preferable, even when a set of pre-trained options is not given.

A Provably Efficient Option-Based Algorithm for both High-Level and Low-Level Learning

Gianluca Drappo;Alberto Maria Metelli;Marcello Restelli
2023-01-01

Abstract

Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the option framework, previous works have conceived provably efficient algorithms for the case in which the options are *fixed* and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which both the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, in this paper, we propose a novel meta-algorithm that alternates between two regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we look at the problem as a Semi-Markov Decision Process (SMDP), keeping the low-level policies fixed, while at a lower level, we learn the inner option policies by keeping the high-level policy fixed. Then, we specialize the results for a specific choice of algorithms, where we propose a novel provably efficient algorithm for the finite-horizon SMDPs, and we use a state-of-the-art regret minimizer for the options learning. We compare the bounds derived with those of state-of-the-art regret minimization algorithms for non-hierarchical finite-horizon problems. The comparison allows us to characterize the class of problems in which a hierarchical approach is provably preferable, even when a set of pre-trained options is not given.
2023
Sixteenth European Workshop on Reinforcement Learning
Reinforcement Learning, Hierarchical Reinforcement Learning, Options
File in questo prodotto:
File Dimensione Formato  
TwoPhase_EWRL2023_cameraready.pdf

accesso aperto

Descrizione: Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the *option* framework, previous works have conceived provably efficient algorithms for the case in which the options are *fixed* and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which *both* the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, in this paper, we propose a novel meta-algorithm that alternates between two regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we look at the problem as a Semi-Markov Decision Process (SMDP), keeping the low-level policies fixed, while at a lower level, we learn the inner option policies by keeping the high-level policy fixed. Then, we specialize the results for a specific choice of algorithms, where we propose a novel provably efficient algorithm for the finite-horizon SMDPs, and we use a state-of-the-art regret minimizer for the options learning. We compare the bounds derived with those of state-of-the-art regret minimization algorithms for non-hierarchical finite-horizon problems. The comparison allows us to characterize the class of problems in which a hierarchical approach is provably preferable, even when a set of pre-trained options is not given.
: Publisher’s version
Dimensione 443.43 kB
Formato Adobe PDF
443.43 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1249489
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact