LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Deep Learning-Based Job Placement in Distributed Machine Learning Clusters With Heterogeneous Workloads

Nowadays, most leading IT companies host a variety of distributed machine learning (ML) workloads in ML clusters to support AI-driven services, such as speech recognition, machine translation, and image processing.… Click to show full abstract

Nowadays, most leading IT companies host a variety of distributed machine learning (ML) workloads in ML clusters to support AI-driven services, such as speech recognition, machine translation, and image processing. While multiple jobs are executed concurrently in a shared cluster to improve resource utilization, interference among co-located ML jobs can lead to significant performance downgrade. Existing cluster schedulers, such as YARN and Mesos, are interference-agnostic in their job placement, leading to suboptimal resource efficiency and usage. Some literature has studied interference-aware job placement policy, but relies on detailed workload profiling and interference modeling, which is not a general solution. In this work, we present Harmony, a deep learning-driven ML cluster scheduler that places heterogeneous training jobs (either with parameter server architecture or all-reduce architecture) in a manner that minimizes interference and maximizes performance (i.e., training completion time minimization). The design of Harmony is based on a carefully designed deep reinforcement learning (DRL) framework enhanced with reward modeling. The DRL integrates a dynamic sequence-to-sequence model with the state-of-the-art techniques to stabilize training and improve convergence, including actor-critic algorithm, job-aware action space exploration, multi-head attention, and experience replay. In view of a common lack of reward samples corresponding to different placement decisions, we build an auxiliary sequence-to-sequence reward prediction model, which is trained with historical samples and used for producing reward for unseen placement. Experiments using real ML workloads in a Kubernetes cluster of 6 GPU servers show that Harmony outperforms representative schedulers by 16%–42% in terms of average job completion time.

Keywords: distributed machine; job; placement; interference; job placement

Journal Title: IEEE/ACM Transactions on Networking
Year Published: 2023

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.