LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Beamer: Stage-Aware Coflow Scheduling to Accelerate Hyper-Parameter Tuning in Deep Learning Clusters

Photo by timothyeberly from unsplash

Training a neural network requires retraining the same model many times to search for the configuration of hyper-parameters with the best training result. It is common to launch multiple training… Click to show full abstract

Training a neural network requires retraining the same model many times to search for the configuration of hyper-parameters with the best training result. It is common to launch multiple training jobs and evaluate them in stages. At the completion of each stage, jobs with unpromising configurations will be terminated and jobs with new configurations will start. Each job typically performs distributed training across multiple GPUs, and GPUs periodically synchronize their models over the network. However, model synchronizations of running jobs cause severe network congestion, significantly increasing the stage completion time (SCT) and thus the time to successfully search for the desired configuration. Existing flow schedulers are ineffective to reduce SCT since they are agnostic to training stages. In this paper, we propose a stage-aware coflow scheduling method to minimize the average SCT. In this method, an algorithm is designed to order coflows by considering stage information and then coflows are scheduled according to the order. Mathematical analysis shows that the method achieves the average SCT within 20/3 of the optimal. We implement the method in a real system called Beamer. Extensive testbed experiments and simulations show that Beamer significantly outperforms advanced network designs, such as Sincronia, FIFO-LM, and per-flow fair sharing.

Keywords: coflow scheduling; stage aware; network; stage; aware coflow; hyper

Journal Title: IEEE Transactions on Network and Service Management
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.