The success of sequential decision-making ap-proaches, such as reinforcemeu learning (RL), is closely tied to the availability of a reward feed-back. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with prefer-ence feedback provided, for instance, by a human expert. We aim to build a theoretical basis link-ing preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connec tions between them. First, we model preference feedback using a partial (pre)order over trajecto-ries, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in ex-isting works. Second, to provide a theoretical justification for a common practice, we investi-gate how a preference relation can be approxi-mated by a multi-objective utility. We introduce a notion of preference-utility compatibility and ana-lyze the computational complexity of this transfor-mation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy domi-nance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Marko-vian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feed-back, with promising potential applications in RI. from human feedback.

Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback

Simone Drago;Marco Mussi;Alberto Maria Metelli
2025-01-01

Abstract

The success of sequential decision-making ap-proaches, such as reinforcemeu learning (RL), is closely tied to the availability of a reward feed-back. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with prefer-ence feedback provided, for instance, by a human expert. We aim to build a theoretical basis link-ing preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connec tions between them. First, we model preference feedback using a partial (pre)order over trajecto-ries, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in ex-isting works. Second, to provide a theoretical justification for a common practice, we investi-gate how a preference relation can be approxi-mated by a multi-objective utility. We introduce a notion of preference-utility compatibility and ana-lyze the computational complexity of this transfor-mation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy domi-nance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Marko-vian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feed-back, with promising potential applications in RI. from human feedback.
2025
42nd International Conference on Machine Learning, ICML 2025
File in questo prodotto:
File Dimensione Formato  
_ICML_2025___Camera_Ready__Preference_based_Framework (1).pdf

accesso aperto

Dimensione 439.14 kB
Formato Adobe PDF
439.14 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1292596
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact