Driving exploration by maximum distribution in gaussian process bandits

Nuara, A.; Trovò, F.; Crippa, D.; Gatti, N; Restelli, M.

The problem of finding optimal solutions of stochastic functions over continuous domains is common in several real-world applications, such as, e.g., advertisement allocation, dynamic pricing, and power control in wireless networks. The optimization process is customarily performed by selecting input points sequentially and receiving a noisy observation from the function. In this paper, we resort to the Multi-Armed Bandit approach, aiming at optimizing stochastic functions when keeping at a pace the regret (i.e., the loss incurred during the learning process) during the learning process. In particular, we focus on smooth stochastic functions, as it is known that any algorithm suffers from a constant per-round regret when the domain is continuous, and the function does not satisfy any kind of regularity. Our main original contribution is the provision of a general family of algorithms, which, under the mild assumption that stochastic functions are a realization of a Gaussian Process, provides a regret of the order of O(pγTT), being γT the maximum information gain and T the time horizon used for the learning process. Furthermore, we design a specific algorithm of our family, called DAGP-UCB, which exploits the structure of GPs to select the next arm to pull more effectively than the previous algorithms available in the state of the art, thus speeding up the learning process. In particular, we show the superior performance of DAGP-UCB in both synthetic and applicative settings, comparing it with the state-of-the-art algorithms.