"Accelerating Allreduce With In-Network Reduction on Intel PIUMA"

The Intel Programmable Integrated Unified Memory Architecture (PIUMA) system maps collective operations directly into the network switches and supports pipelined embeddings for high-throughput collective computation. Utilizing these features and PIUMA’s network topology, we develop a methodology to generate extremely low latency embeddings for in-network Allreduce. Our analysis shows that the proposed in-network Allreduce is highly scalable, with less than 1.5-μs single-element latency on 16K nodes. Compared to host-based Allreduce, it exhibits 36× less latency and 3.5× higher throughput. With deep neural network training as an example, we further demonstrate the benefits of our in-network Allreduce on end-user applications.

Keywords: network; network allreduce; accelerating allreduce; piuma; intel

Journal Title: IEEE Micro
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
2

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended