In this work, we study the statistical properties of the off-policy estimation problem, i.e., estimating expectations under a target policy using samples collected from a different policy. We begin by presenting a novel minimax concentration lower bound that highlights the fundamental limits of off-policy estimation. We then analyze two well-known importance weighting (IW) techniques: vanilla IW and self-normalized importance weighting (SN). For both methods, we derive concentration and anti-concentration results, showing that their concentration rates are provably suboptimal compared to our lower bound. Observing that this undesired behavior arises from the heavy-tailed nature of the IW and SN estimators, we propose a new class of parametric estimators based on a transformation using the power mean (PM), which is no longer heavy-tailed. We study the theoretical properties of the PM estimator in terms of bias and variance. We show that, with suitable (possibly data-driven) tuning of its parameters, the PM estimator satisfies two key properties under certain conditions: (i) it achieves a subgaussian concentration rate that matches our lower bound and (ii) it maintains differentiability with respect to the target policy. Finally, we validate our approach through numerical simulations on both synthetic datasets and contextual bandits, comparing it against standard off-policy evaluation and learning baselines.1
Minimax off-policy evaluation and learning with subgaussian and differentiable importance weighting
Metelli A. M.;Russo A.;Restelli M.
2025-01-01
Abstract
In this work, we study the statistical properties of the off-policy estimation problem, i.e., estimating expectations under a target policy using samples collected from a different policy. We begin by presenting a novel minimax concentration lower bound that highlights the fundamental limits of off-policy estimation. We then analyze two well-known importance weighting (IW) techniques: vanilla IW and self-normalized importance weighting (SN). For both methods, we derive concentration and anti-concentration results, showing that their concentration rates are provably suboptimal compared to our lower bound. Observing that this undesired behavior arises from the heavy-tailed nature of the IW and SN estimators, we propose a new class of parametric estimators based on a transformation using the power mean (PM), which is no longer heavy-tailed. We study the theoretical properties of the PM estimator in terms of bias and variance. We show that, with suitable (possibly data-driven) tuning of its parameters, the PM estimator satisfies two key properties under certain conditions: (i) it achieves a subgaussian concentration rate that matches our lower bound and (ii) it maintains differentiability with respect to the target policy. Finally, we validate our approach through numerical simulations on both synthetic datasets and contextual bandits, comparing it against standard off-policy evaluation and learning baselines.1| File | Dimensione | Formato | |
|---|---|---|---|
|
1-s2.0-S0004370225001389-main.pdf
accesso aperto
Dimensione
3.12 MB
Formato
Adobe PDF
|
3.12 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


