Distributed deep learning has been widely used in training deep neural networks, especially for big models on massive datasets. Parameter Server (PS) architecture is the most popular distributed training framework,… Click to show full abstract
Distributed deep learning has been widely used in training deep neural networks, especially for big models on massive datasets. Parameter Server (PS) architecture is the most popular distributed training framework, which can flexibly design the global parameter update manner. However, when scaling to complex heterogeneous clusters, stragglers make it difficult for existing distributed paradigms on PS framework to balance between synchronous waiting and staleness, which slows down the model training sharply. In this paper, we propose Grouping Stale Synchronous Parallel (GSSP) scheme, which groups workers with similar performance together. Group servers coordinate intra-group workers using Stale Synchronous Parallel while they communicate with each other asynchronously to eliminate stragglers and refine the model weights. We further propose Grouping Dynamic Tok-K Sparsification (GDTopK), which dynamically adjusts the upload ratio for each group so as to make communication volume differentiated and mitigate inter-group iteration speed gap. We have conducted experiments on LeNet-5 on MNIST, ResNet-18, VGG-19 on Cifar-10 and Seq2Seq on Multi30k. Results show that GSSP accelerates the training by 46%120%, with less than 1% accuracy drop. And GDTopK can make up for part of the lost accuracy.
               
Click one of the above tabs to view related content.