"GSSP: Eliminating Stragglers through Grouping Synchronous for Distributed Deep Learning in Heterogeneous Cluster"

Distributed deep learning has been widely used in training deep neural networks, especially for big models on massive datasets. Parameter Server (PS) architecture is the most popular distributed training framework, which can flexibly design the global parameter update manner. However, when scaling to complex heterogeneous clusters, stragglers make it difficult for existing distributed paradigms on PS framework to balance between synchronous waiting and staleness, which slows down the model training sharply. In this paper, we propose Grouping Stale Synchronous Parallel (GSSP) scheme, which groups workers with similar performance together. Group servers coordinate intra-group workers using Stale Synchronous Parallel while they communicate with each other asynchronously to eliminate stragglers and refine the model weights. We further propose Grouping Dynamic Tok-K Sparsification (GDTopK), which dynamically adjusts the upload ratio for each group so as to make communication volume differentiated and mitigate inter-group iteration speed gap. We have conducted experiments on LeNet-5 on MNIST, ResNet-18, VGG-19 on Cifar-10 and Seq2Seq on Multi30k. Results show that GSSP accelerates the training by 46%120%, with less than 1% accuracy drop. And GDTopK can make up for part of the lost accuracy.

Keywords: distributed deep; deep learning; group; eliminating stragglers; gssp eliminating

Journal Title: IEEE Transactions on Cloud Computing
Year Published: 2021

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
0

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended