In real-world problems such as robotics, finance and healthcare, randomness is always present, thus, it is important to take risk into consideration in order to limit the chance of rare but dangerous events. The literature on risk-averse reinforcement learning has produced many different approaches to tackle the problem, but they either struggle to scale up to complex instances, or they exhibit irrational behaviors. Here we present two novel risk-averse objectives that are both coherent and easy to optimize: the reward-based mean-mean absolute deviation (Mean-RMAD) and the reward-based conditional value at risk (RCVaR). Instead of reducing the return risk, these measures minimize the per-step reward one. We prove that these risk measures bound the corresponding return-based risk measures, so that they can be also used as proxies for their return-based versions. We develop safe algorithms for these risk measures with guaranteed monotonic improvement, and their practical trust-region versions. Furthermore, we propose a decomposition for the RCVaR optimization problem into a sequence of risk-neutral problems. Finally, we conduct an empirical analysis on the introduced approaches, demonstrating their effectiveness in retrieving a variety of risk-averse behaviors on both toy problems and more challenging ones, such as a simulated trading environment and robotic locomotion tasks.
Risk-averse optimization of reward-based coherent risk measures
Bisi L.;Restelli M.
2023-01-01
Abstract
In real-world problems such as robotics, finance and healthcare, randomness is always present, thus, it is important to take risk into consideration in order to limit the chance of rare but dangerous events. The literature on risk-averse reinforcement learning has produced many different approaches to tackle the problem, but they either struggle to scale up to complex instances, or they exhibit irrational behaviors. Here we present two novel risk-averse objectives that are both coherent and easy to optimize: the reward-based mean-mean absolute deviation (Mean-RMAD) and the reward-based conditional value at risk (RCVaR). Instead of reducing the return risk, these measures minimize the per-step reward one. We prove that these risk measures bound the corresponding return-based risk measures, so that they can be also used as proxies for their return-based versions. We develop safe algorithms for these risk measures with guaranteed monotonic improvement, and their practical trust-region versions. Furthermore, we propose a decomposition for the RCVaR optimization problem into a sequence of risk-neutral problems. Finally, we conduct an empirical analysis on the introduced approaches, demonstrating their effectiveness in retrieving a variety of risk-averse behaviors on both toy problems and more challenging ones, such as a simulated trading environment and robotic locomotion tasks.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.