Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings-very common in real-world applications-received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for non-stationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.

Sliding-Window Thompson Sampling for Non-Stationary Settings

Trovo, Francesco;Paladino, Stefano;Restelli, Marcello;Gatti, Nicola
2020-01-01

Abstract

Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings-very common in real-world applications-received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for non-stationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.
File in questo prodotto:
File Dimensione Formato  
allegatoF1_trovo2020sliding.pdf

accesso aperto

Descrizione: Articolo principale
: Publisher’s version
Dimensione 809.17 kB
Formato Adobe PDF
809.17 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11311/1143119
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 37
  • ???jsp.display-item.citation.isi??? 14
social impact