"Deep Neural Network Training with Distributed K-FAC"

Scaling deep neural network training to more processors and larger batch sizes is key to reducing end-to-end training time; yet, maintaining comparable convergence and hardware utilization at larger scales is a challenge. Increases in training scales has enabled natural gradient optimization methods as a reasonable alternative to SGD and variants thereof. Kronecker-factored Approximate Curvature (K-FAC), a natural gradient method, preconditions gradients with an efficient approximation of the Fisher Information Matrix to improve per-iteration progress when optimizing an objective function. In this work, we propose a scalable K-FAC algorithm and investigate K-FACs applicability in large-scale deep neural network training. Specifically, we explore layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K-FAC update decoupling with the goal of preserving convergence while minimizing training time. A study on the convergence and scaling properties of our K-FAC gradient preconditioner is presented using applications in the image classification, object detection, and language modeling domains. In all applications, our implementation converges to baseline performance targets in 925% less than the standard first-order optimizers on GPU clusters across a variety scales

Keywords: neural network; training; deep neural; training distributed; network training

Journal Title: IEEE Transactions on Parallel and Distributed Systems
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
2

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended