Adapting bandit algorithms for settings with sequentially available arms

Gabrielli, Marco; Antonelli, Manuela; Trovo, Francesco

doi:10.1016/j.engappai.2023.107815

Many real-world applications involve a sequential decision-making process where the options presented simultaneously. However, other applications, such as, Internet campaign management and environmental monitoring, the available options are presented sequentially to the decision-maker who, at each time, is asked to select the proposed option or not. This scenario is defined as the Sequential Pull/No-Pull setting The present study aims at developing a meta-algorithm, namely Sequential Pull/No-pull for MAB (Seq), to adapt any classical MAB (Multi-Armed Bandit) policy for this setting both in the case of regret minimization (RM) and best-arm identification (BAI) problems. This is achieved by exploting the sequential nature of the these settings allowing to select multiple arms and gather more information compared to classical policies. The proposed Seq meta-algorithm provides the same theoretical guarantees as the MAB policy employed, but was shown to provide improved performance compared to several classical MAB policies in RM and BAI problems employing real-world data. In particular, in the RM scenario regarding Internet advertising optimization, Seq-adapted algorithm resulted, on average, in ≈10% lower regret during the whole time horizon than using classical MAB policies. When tested in a BAI problem involving the identification of the time of the day characterized by the highest concentration of pollutants in a water monitoring scenario, Seq identified the correct time in less than 4 days and 28 measurement.