The Intel Programmable Integrated Unified Memory Architecture (PIUMA) system maps collective operations directly into the network switches and supports pipelined embeddings for high-throughput collective computation. Utilizing these features and PIUMA’s… Click to show full abstract
The Intel Programmable Integrated Unified Memory Architecture (PIUMA) system maps collective operations directly into the network switches and supports pipelined embeddings for high-throughput collective computation. Utilizing these features and PIUMA’s network topology, we develop a methodology to generate extremely low latency embeddings for in-network Allreduce. Our analysis shows that the proposed in-network Allreduce is highly scalable, with less than 1.5-μs single-element latency on 16K nodes. Compared to host-based Allreduce, it exhibits 36× less latency and 3.5× higher throughput. With deep neural network training as an example, we further demonstrate the benefits of our in-network Allreduce on end-user applications.
               
Click one of the above tabs to view related content.