A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run

Castelletti, ANDREA FRANCESCO; Pianosi, Francesca; Restelli, Marcello

doi:10.1002/wrcr.20295

The operation of large-scale water resources systems often involves several conflicting and noncommensurable objectives. The full characterization of tradeoffs among them is a necessary step to inform and support decisions in the absence of a unique optimal solution. In this context, the common approach is to consider many single objective problems, resulting from different combinations of the original problem objectives, each one solved using standard optimization methods based on mathematical programming. This scalarization process is computationally very demanding as it requires one optimization run for each trade-off and often results in very sparse and poorly informative representations of the Pareto frontier. More recently, bio-inspired methods have been applied to compute an approximation of the Pareto frontier in one single run. These methods allow to acceptably cover the full extent of the Pareto frontier with a reasonable computational effort. Yet, the quality of the policy obtained might be strongly dependent on the algorithm tuning and preconditioning. In this paper we propose a novel multiobjective Reinforcement Learning algorithm that combines the advantages of the above two approaches and alleviates some of their drawbacks. The proposed algorithm is an extension of fitted Q-iteration (FQI) that enables to learn the operating policies for all the linear combinations of preferences (weights) assigned to the objectives in a single training process. The key idea of multiobjective FQI (MOFQI) is to enlarge the continuous approximation of the value function, that is performed by single objective FQI over the state-decision space, also to the weight space. The approach is demonstrated on a real-world case study concerning the optimal operation of the HoaBinh reservoir on the Da river, Vietnam. MOFQI is compared with the reiterated use of FQI and a multiobjective parameterization-simulation-optimization (MOPSO) approach. Results show that MOFQI provides a continuous approximation of the Pareto front with comparable accuracy as the reiterated use of FQI. MOFQI outperforms MOPSO when no a priori knowledge on the operating policy shape is available, while produces slightly less accurate solutions when MOPSO can exploit such knowledge.