n the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent’s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order Õ(H√SAT), where H is the horizon, S and A are the cardinalities of the state and action spaces, respectively, and T is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order Ω(√T2A/2) when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order Õ(H√SAT), showing that the problem shares the same complexity of standard RL.

Sleeping Reinforcement Learning

Simone Drago;Marco Mussi;Alberto Maria Metelli
2025-01-01

Abstract

n the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent’s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order Õ(H√SAT), where H is the horizon, S and A are the cardinalities of the state and action spaces, respectively, and T is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order Ω(√T2A/2) when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order Õ(H√SAT), showing that the problem shares the same complexity of standard RL.
2025
42nd International Conference on Machine Learning, ICML 2025
File in questo prodotto:
File Dimensione Formato  
_ICML_2025___Camera_Ready__Sleeping_Reinforcement_Learning (1).pdf

accesso aperto

Dimensione 742.86 kB
Formato Adobe PDF
742.86 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1292606
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact