Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is… Click to show full abstract
Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is a challenge. Increases in training scales has enabled natural gradient optimization methods as a reasonable alternative to SGD and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, preconditions gradients with an efficient approximation of the Fisher Information Matrix to improve per-iteration progress when optimizing an objective function. In this work, we propose a scalable K-FAC algorithm and investigate K-FACs applicability in large-scale deep neural network training. Specifically, we explore layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling with the goal of preserving convergence while minimizing training time. A study on the convergence and scaling properties of our K-FAC gradient preconditioner is presented using applications in the image classification, object detection, and language modeling domains. In all applications, our implementation converges to baseline performance targets in 925% less than the standard first-order optimizers on GPU clusters across a variety scales
               
Click one of the above tabs to view related content.