Experience replay plays an essential role as an information-generating mechanism in reinforcement learning systems that use neural networks as function approximators. It enables the artificial learning agents to store their past experiences in a sliding-window buffer, effectively recycling them in the process of a continual re-training of a neural network. The intermediary process of experience caching opens a possibility for an agent to optimize the order in which the experiences are sampled from the buffer. This may improve the default standard, i.e., the stochastic prioritization based on Temporal-Difference error (or TD-error), which focuses on experiences that carry more Temporal-Difference surprise for the approximator. A notion of informed prioritization is proposed, a method relying on fast on-line confidence estimates of approximator predictions in order to be able to dynamically exploit the benefits of TD-error prioritization only when its prediction confidence about the selected experiences increases. The presented informed-stochastic prioritization method of replay buffer sampling, implemented as a part of standard staple Deep Q-learning algorithm outperformed the vanilla stochastic prioritization based on TD-error in 41 out of 54 trialed Atari games.
Informed Sampling of Prioritized Experience Replay
Ramicic M.;Bonarini A.
2022-01-01
Abstract
Experience replay plays an essential role as an information-generating mechanism in reinforcement learning systems that use neural networks as function approximators. It enables the artificial learning agents to store their past experiences in a sliding-window buffer, effectively recycling them in the process of a continual re-training of a neural network. The intermediary process of experience caching opens a possibility for an agent to optimize the order in which the experiences are sampled from the buffer. This may improve the default standard, i.e., the stochastic prioritization based on Temporal-Difference error (or TD-error), which focuses on experiences that carry more Temporal-Difference surprise for the approximator. A notion of informed prioritization is proposed, a method relying on fast on-line confidence estimates of approximator predictions in order to be able to dynamically exploit the benefits of TD-error prioritization only when its prediction confidence about the selected experiences increases. The presented informed-stochastic prioritization method of replay buffer sampling, implemented as a part of standard staple Deep Q-learning algorithm outperformed the vanilla stochastic prioritization based on TD-error in 41 out of 54 trialed Atari games.File | Dimensione | Formato | |
---|---|---|---|
Informed_Sampling_of_Prioritized_Experience_Replay.pdf
Accesso riservato
Descrizione: Main text
:
Publisher’s version
Dimensione
951.97 kB
Formato
Adobe PDF
|
951.97 kB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.