Reinforcement Learning (RL) focuses on learning policies that maximize the expected reward. This simple objective has enabled the success of RL in a wide range of scenarios. However, as emphasized by control-theoretic methods, stability is also a desired property when dealing with real-world systems. In this paper, we take a first step toward incorporating the notion of stability into RL. We focus on planning in ergodic Markov Decision Processes (MDPs), i.e., those that converge to a unique stationary distribution under any policy. We define the notion of stability in this context as the speed at which the induced Markov Chain (MC) converges to its stationary distribution. Noting that this property is connected to the spectral characteristics of the induced MC, we study the challenges of including a stability-related term in the RL objective function. First, we highlight how naive approaches to trading off between reward maximization and stability lead to bilinear optimization programs, which are computationally demanding. Second, we propose an approach that bypasses this issue through a novel formulation and a surrogate objective function.

Trading-off Reward Maximization and Stability in Sequential Decision Making

Federico Corso;Marco Mussi;Alberto Maria Metelli
2025-01-01

Abstract

Reinforcement Learning (RL) focuses on learning policies that maximize the expected reward. This simple objective has enabled the success of RL in a wide range of scenarios. However, as emphasized by control-theoretic methods, stability is also a desired property when dealing with real-world systems. In this paper, we take a first step toward incorporating the notion of stability into RL. We focus on planning in ergodic Markov Decision Processes (MDPs), i.e., those that converge to a unique stationary distribution under any policy. We define the notion of stability in this context as the speed at which the induced Markov Chain (MC) converges to its stationary distribution. Noting that this property is connected to the spectral characteristics of the induced MC, we study the challenges of including a stability-related term in the RL objective function. First, we highlight how naive approaches to trading off between reward maximization and stability lead to bilinear optimization programs, which are computationally demanding. Second, we propose an approach that bypasses this issue through a novel formulation and a surrogate objective function.
2025
Eighteenth European Workshop on Reinforcement Learning
Reinforcement Learning, Markov Decision Processes, Stability
File in questo prodotto:
File Dimensione Formato  
165_Trading_off_Reward_Maximiz.pdf

accesso aperto

: Publisher’s version
Dimensione 401.54 kB
Formato Adobe PDF
401.54 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1298045
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact