Distributed optimization is widely used to solve large-scale optimization problems by parallelizing gradient-based algorithms across multiple computing nodes. In asynchronous optimization, the optimization parameter is updated using stale gradients, which… Click to show full abstract
Distributed optimization is widely used to solve large-scale optimization problems by parallelizing gradient-based algorithms across multiple computing nodes. In asynchronous optimization, the optimization parameter is updated using stale gradients, which are gradients computed with respect to outdated parameters. Although large degrees of staleness can slow convergence, little is known about the impact of staleness and its relation to other system parameters. In this work, we analyze asynchronous optimization when implemented using either hub-and-spoke or shared memory architectures. We show that the process of gradient arrival to the master node is similar in nature to a renewal process. We derive the bandwidth requirement of the system. For the hub-and-spoke setup, we derive bounds on the expected gradient staleness and show its connection to other system parameters such as the number of workers, expected compute time, and communication delays. Our derivations reveal that it is possible to adjust gradient staleness by tuning certain parameters such as minibatch size or the number of workers. For the shared memory architecture, we show that the expected staleness is equivalent to the number of workers. Our derivations can be used in existing convergence analyses to express convergence rates in terms of other known system parameters. Such an expression gives further details on what factors impact convergence.
               
Click one of the above tabs to view related content.