In this work, we study the statistical properties of the off-policy estimation problem, i.e., estimating expectations under a target policy using samples collected from a different policy. We begin by presenting a novel minimax concentration lower bound that highlights the fundamental limits of off-policy estimation. We then analyze two well-known importance weighting (IW) techniques: vanilla IW and self-normalized importance weighting (SN). For both methods, we derive concentration and anti-concentration results, showing that their concentration rates are provably suboptimal compared to our lower bound. Observing that this undesired behavior arises from the heavy-tailed nature of the IW and SN estimators, we propose a new class of parametric estimators based on a transformation using the power mean (PM), which is no longer heavy-tailed. We study the theoretical properties of the PM estimator in terms of bias and variance. We show that, with suitable (possibly data-driven) tuning of its parameters, the PM estimator satisfies two key properties under certain conditions: (i) it achieves a subgaussian concentration rate that matches our lower bound and (ii) it maintains differentiability with respect to the target policy. Finally, we validate our approach through numerical simulations on both synthetic datasets and contextual bandits, comparing it against standard off-policy evaluation and learning baselines.1

Minimax off-policy evaluation and learning with subgaussian and differentiable importance weighting

Metelli A. M.;Russo A.;Restelli M.
2025-01-01

Abstract

In this work, we study the statistical properties of the off-policy estimation problem, i.e., estimating expectations under a target policy using samples collected from a different policy. We begin by presenting a novel minimax concentration lower bound that highlights the fundamental limits of off-policy estimation. We then analyze two well-known importance weighting (IW) techniques: vanilla IW and self-normalized importance weighting (SN). For both methods, we derive concentration and anti-concentration results, showing that their concentration rates are provably suboptimal compared to our lower bound. Observing that this undesired behavior arises from the heavy-tailed nature of the IW and SN estimators, we propose a new class of parametric estimators based on a transformation using the power mean (PM), which is no longer heavy-tailed. We study the theoretical properties of the PM estimator in terms of bias and variance. We show that, with suitable (possibly data-driven) tuning of its parameters, the PM estimator satisfies two key properties under certain conditions: (i) it achieves a subgaussian concentration rate that matches our lower bound and (ii) it maintains differentiability with respect to the target policy. Finally, we validate our approach through numerical simulations on both synthetic datasets and contextual bandits, comparing it against standard off-policy evaluation and learning baselines.1
2025
Off-policy estimation
Importance weighting
Power mean transformation
Subgaussian concentration
Differentiable importance weighting
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S0004370225001389-main.pdf

accesso aperto

Dimensione 3.12 MB
Formato Adobe PDF
3.12 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1310548
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact