This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy{iteration algorithms. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., at each iteration, we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy, until no improving policy can be found. We propose three safe policy{iteration schemas that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared on some chain-walk domains, the prison domain, and on the Blackjack card game.

Safe policy iteration: A monotonically improving approximate policy iteration approach

Metelli A. M.;Pirotta M.;Calandriello D.;Restelli M.
2021-01-01

Abstract

This paper presents a study of the policy improvement step that can be usefully exploited by approximate policy{iteration algorithms. When either the policy evaluation step or the policy improvement step returns an approximated result, the sequence of policies produced by policy iteration may not be monotonically increasing, and oscillations may occur. To address this issue, we consider safe policy improvements, i.e., at each iteration, we search for a policy that maximizes a lower bound to the policy improvement w.r.t. the current policy, until no improving policy can be found. We propose three safe policy{iteration schemas that differ in the way the next policy is chosen w.r.t. the estimated greedy policy. Besides being theoretically derived and discussed, the proposed algorithms are empirically evaluated and compared on some chain-walk domains, the prison domain, and on the Blackjack card game.
2021
Approximate Dynamic Programming
Approximate Policy Iteration
Markov Decision Process
Policy Chattering
Policy Oscillation
Reinforcement Learning
File in questo prodotto:
File Dimensione Formato  
19-707.pdf

accesso aperto

: Publisher’s version
Dimensione 1.1 MB
Formato Adobe PDF
1.1 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1177647
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 8
  • ???jsp.display-item.citation.isi??? 3
social impact