RE.PUBLIC@POLIMI pubblicazioni di ricerca del Politecnico di Milano

Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $\mathcal{O}(\epsilon^{-2})$ trajectories to reach an $\epsilon$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $\mathcal{O}(\epsilon^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $\win$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\widetilde{\mathcal{O}}(\epsilon^{-2}\win^{-1})$. When reusing all available past trajectories, this leads to a rate of $\widetilde{\mathcal{O}}(\epsilon^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.

Reusing Trajectories in Policy Gradients Enables Fast Convergence

A. Montenegro;F. Mansutti;M. Mussi;M. Papini;A. M. Metelli

2026-01-01

Abstract

Policy gradient (PG) methods are a class of effective reinforcement learning algorithms, particularly when dealing with continuous control problems. They rely on fresh on-policy data, making them sample-inefficient and requiring $\mathcal{O}(\epsilon^{-2})$ trajectories to reach an $\epsilon$-approximate stationary point. A common strategy to improve efficiency is to reuse information from past iterations, such as previous gradients or trajectories, leading to off-policy PG methods. While gradient reuse has received substantial attention, leading to improved rates up to $\mathcal{O}(\epsilon^{-3/2})$, the reuse of past trajectories, although intuitive, remains largely unexplored from a theoretical perspective. In this work, we provide the first rigorous theoretical evidence that reusing past off-policy trajectories can significantly accelerate PG convergence. We propose RT-PG (Reusing Trajectories - Policy Gradient), a novel algorithm that leverages a power mean-corrected multiple importance weighting estimator to effectively combine on-policy and off-policy data coming from the most recent $\win$ iterations. Through a novel analysis, we prove that RT-PG achieves a sample complexity of $\widetilde{\mathcal{O}}(\epsilon^{-2}\win^{-1})$. When reusing all available past trajectories, this leads to a rate of $\widetilde{\mathcal{O}}(\epsilon^{-1})$, the best known one in the literature for PG methods. We further validate our approach empirically, demonstrating its effectiveness against baselines with state-of-the-art rates.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Titolo del libro
	
				43rd International Conference on Machine Learning, ICML 2026
			
	Titolo della collana
	
				PROCEEDINGS OF MACHINE LEARNING RESEARCH
			
	Appare nelle tipologie:
	
				04.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
_ICML_2026__Trajectory_Reuse_in_PGs (2).pdf accesso aperto : Post-Print (DRAFT o Author’s Accepted Manuscript-AAM) Dimensione 1.78 MB Formato Adobe PDF Visualizza/Apri	1.78 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1317330

Citazioni

ND

ND

ND

ND

social impact