The success of sequential decision-making ap-proaches, such as reinforcemeu learning (RL), is closely tied to the availability of a reward feed-back. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with prefer-ence feedback provided, for instance, by a human expert. We aim to build a theoretical basis link-ing preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connec tions between them. First, we model preference feedback using a partial (pre)order over trajecto-ries, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in ex-isting works. Second, to provide a theoretical justification for a common practice, we investi-gate how a preference relation can be approxi-mated by a multi-objective utility. We introduce a notion of preference-utility compatibility and ana-lyze the computational complexity of this transfor-mation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy domi-nance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Marko-vian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feed-back, with promising potential applications in RI. from human feedback.
Towards Theoretical Understanding of Sequential Decision Making with Preference Feedback
Simone Drago;Marco Mussi;Alberto Maria Metelli
2025-01-01
Abstract
The success of sequential decision-making ap-proaches, such as reinforcemeu learning (RL), is closely tied to the availability of a reward feed-back. However, designing a reward function that encodes the desired objective is a challenging task. In this work, we address a more realistic scenario: sequential decision making with prefer-ence feedback provided, for instance, by a human expert. We aim to build a theoretical basis link-ing preferences, (non-Markovian) utilities, and (Markovian) rewards, and we study the connec tions between them. First, we model preference feedback using a partial (pre)order over trajecto-ries, enabling the presence of incomparabilities that are common when preferences are provided by humans but are surprisingly overlooked in ex-isting works. Second, to provide a theoretical justification for a common practice, we investi-gate how a preference relation can be approxi-mated by a multi-objective utility. We introduce a notion of preference-utility compatibility and ana-lyze the computational complexity of this transfor-mation, showing that constructing the minimum-dimensional utility is NP-hard. Third, we propose a novel concept of preference-based policy domi-nance that does not rely on utilities or rewards and discuss the computational complexity of assessing it. Fourth, we develop a computationally efficient algorithm to approximate a utility using (Marko-vian) rewards and quantify the error in terms of the suboptimality of the optimal policy induced by the approximating reward. This work aims to lay the foundation for a principled approach to sequential decision making from preference feed-back, with promising potential applications in RI. from human feedback.| File | Dimensione | Formato | |
|---|---|---|---|
|
_ICML_2025___Camera_Ready__Preference_based_Framework (1).pdf
accesso aperto
Dimensione
439.14 kB
Formato
Adobe PDF
|
439.14 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


