RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

In this work, we study the statistical properties of the off-policy estimation problem, i.e., estimating expectations under a target policy using samples collected from a different policy. We begin by presenting a novel minimax concentration lower bound that highlights the fundamental limits of off-policy estimation. We then analyze two well-known importance weighting (IW) techniques: vanilla IW and self-normalized importance weighting (SN). For both methods, we derive concentration and anti-concentration results, showing that their concentration rates are provably suboptimal compared to our lower bound. Observing that this undesired behavior arises from the heavy-tailed nature of the IW and SN estimators, we propose a new class of parametric estimators based on a transformation using the power mean (PM), which is no longer heavy-tailed. We study the theoretical properties of the PM estimator in terms of bias and variance. We show that, with suitable (possibly data-driven) tuning of its parameters, the PM estimator satisfies two key properties under certain conditions: (i) it achieves a subgaussian concentration rate that matches our lower bound and (ii) it maintains differentiability with respect to the target policy. Finally, we validate our approach through numerical simulations on both synthetic datasets and contextual bandits, comparing it against standard off-policy evaluation and learning baselines.1

Minimax off-policy evaluation and learning with subgaussian and differentiable importance weighting

Metelli A. M.;Russo A.;Restelli M.

2025-01-01

Abstract

In this work, we study the statistical properties of the off-policy estimation problem, i.e., estimating expectations under a target policy using samples collected from a different policy. We begin by presenting a novel minimax concentration lower bound that highlights the fundamental limits of off-policy estimation. We then analyze two well-known importance weighting (IW) techniques: vanilla IW and self-normalized importance weighting (SN). For both methods, we derive concentration and anti-concentration results, showing that their concentration rates are provably suboptimal compared to our lower bound. Observing that this undesired behavior arises from the heavy-tailed nature of the IW and SN estimators, we propose a new class of parametric estimators based on a transformation using the power mean (PM), which is no longer heavy-tailed. We study the theoretical properties of the PM estimator in terms of bias and variance. We show that, with suitable (possibly data-driven) tuning of its parameters, the PM estimator satisfies two key properties under certain conditions: (i) it achieves a subgaussian concentration rate that matches our lower bound and (ii) it maintains differentiability with respect to the target policy. Finally, we validate our approach through numerical simulations on both synthetic datasets and contextual bandits, comparing it against standard off-policy evaluation and learning baselines.1

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Titolo della rivista
	
				ARTIFICIAL INTELLIGENCE
			
	Parole chiave
	
				Off-policy estimation
Importance weighting
Power mean transformation
Subgaussian concentration
Differentiable importance weighting
			
	Appare nelle tipologie:
	
				01.1 Articolo in Rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0004370225001389-main.pdf accesso aperto Dimensione 3.12 MB Formato Adobe PDF Visualizza/Apri	3.12 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1310548

Citazioni

ND

0

0

social impact