RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

n the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent’s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order Õ(H√SAT), where H is the horizon, S and A are the cardinalities of the state and action spaces, respectively, and T is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order Ω(√T2A/2) when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order Õ(H√SAT), showing that the problem shares the same complexity of standard RL.

Sleeping Reinforcement Learning

Simone Drago;Marco Mussi;Alberto Maria Metelli

2025-01-01

Abstract

n the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent’s control. To address these realistic scenarios, we introduce a novel paradigm called Sleeping Reinforcement Learning, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order Õ(H√SAT), where H is the horizon, S and A are the cardinalities of the state and action spaces, respectively, and T is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order Ω(√T2A/2) when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order Õ(H√SAT), showing that the problem shares the same complexity of standard RL.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo del libro
	
				42nd International Conference on Machine Learning, ICML 2025
			
	Titolo della collana
	
				PROCEEDINGS OF MACHINE LEARNING RESEARCH
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
_ICML_2025___Camera_Ready__Sleeping_Reinforcement_Learning (1).pdf accesso aperto Dimensione 742.86 kB Formato Adobe PDF Visualizza/Apri	742.86 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1292606

Citazioni

ND

0

ND

ND

social impact