LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Accelerating Allreduce With In-Network Reduction on Intel PIUMA

Photo by dulhiier from unsplash

The Intel Programmable Integrated Unified Memory Architecture (PIUMA) system maps collective operations directly into the network switches and supports pipelined embeddings for high-throughput collective computation. Utilizing these features and PIUMA’s… Click to show full abstract

The Intel Programmable Integrated Unified Memory Architecture (PIUMA) system maps collective operations directly into the network switches and supports pipelined embeddings for high-throughput collective computation. Utilizing these features and PIUMA’s network topology, we develop a methodology to generate extremely low latency embeddings for in-network Allreduce. Our analysis shows that the proposed in-network Allreduce is highly scalable, with less than 1.5-μs single-element latency on 16K nodes. Compared to host-based Allreduce, it exhibits 36× less latency and 3.5× higher throughput. With deep neural network training as an example, we further demonstrate the benefits of our in-network Allreduce on end-user applications.

Keywords: network; network allreduce; accelerating allreduce; piuma; intel

Journal Title: IEEE Micro
Year Published: 2022

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.