In this paper, we study the problem of universal best arm identification in multi-armed bandits, where the underlying setting can be either stochastic or adversarial and is not revealed to… Click to show full abstract
In this paper, we study the problem of universal best arm identification in multi-armed bandits, where the underlying setting can be either stochastic or adversarial and is not revealed to the forecaster a priori. We propose an algorithm, called S3-BA, that identifies the best arm without prior knowledge of the underlying setting. The key idea is to simultaneously explore the arms and learn the problem setting, and switch when adversarial is declared. We analyze the performance of the proposed algorithm for both stochastic and adversarial bandits. For stochastic bandit, we prove an upper bound on the error rate of S3-BA. For adversarial bandit, we show that the best arm can be guaranteed as long as the budget is above a threshold. Both results are comparable to the best known algorithms that are specifically designed for the corresponding settings. More importantly, we prove that the “learning loss” due to the formulation uncertainty is negligible compared to the arm selection policies, for both stochastic and adversarial bandits. We also propose an adaptive version of S3-BA, Ad-S3-BA, that facilitates the parameter tuning via a simultaneous gap estimation. Experimental results on both synthetic and real-world datasets corroborate the theoretical analysis of S3-BA, and in particular, demonstrate the performance superiority for the real-world datasets that exhibit a “pseudo-stochastic” behavior.
               
Click one of the above tabs to view related content.