Despite the success achieved by the analysis of supervised learning algorithms in the framework of statistical mechanics, reinforcement learning has remained largely untouched by physicists. Here we move towards closing… Click to show full abstract
Despite the success achieved by the analysis of supervised learning algorithms in the framework of statistical mechanics, reinforcement learning has remained largely untouched by physicists. Here we move towards closing the gap by analyzing the dynamics of the policy gradient algorithm. For a convex problem, namely the k-armed bandit, we show that the learning dynamics obeys a drift-diffusion motion described by a Langevin equation, the coefficients of which can be tuned by the learning rate. We explore the striking similarity between our Langevin equation and the Kimura equation, describing genotypes evolution. Furthermore, we propose a mapping between a nonconvex reinforcement learning setting describing multiple joints of a robotic arm and a disordered system, namely a p-spin glass. This mapping enables us to show how the learning rate acts as an effective temperature and thus is capable of smoothing rough landscapes, corroborating what is displayed by the drift-diffusive description and paving the way for physics-inspired algorithmic optimization based on annealing procedures in disordered systems.
               
Click one of the above tabs to view related content.