LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Optimizing Distributed DNN Training Using CPUs and BlueField-2 DPUs

Photo by victorfreitas from unsplash

The deep learning (DL) training process consists of multiple phases—data augmentation, training, and validation of the trained model. Traditionally, these phases are executed either on the central processing units or… Click to show full abstract

The deep learning (DL) training process consists of multiple phases—data augmentation, training, and validation of the trained model. Traditionally, these phases are executed either on the central processing units or graphics processing units in a serial fashion due to lack of additional computing resources to offload independent phases of DL training. Recently, Mellanox/NVIDIA introduced the BlueField-2 data processing units (DPUs), which combine the advanced capabilities of traditional application-specific-integrated-circuit-based network adapters with an array of ARM processors. In this article, we explore how to take advantage of the additional ARM cores on the BlueField-2 DPUs. We propose and evaluate multiple novel designs to efficiently offload the phases of DL training to the DPUs. Our experimental results show that the proposed designs are able to deliver up to 17.5% improvement in overall DL training time. To the best of our knowledge, this is the first work to explore the use of DPUs to accelerate DL training.

Keywords: training; bluefield dpus; distributed dnn; optimizing distributed; processing units

Journal Title: IEEE Micro
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.