"Distributed Encoding and Updating for SAZD Coded Distributed Training"

Linear combination (LC) based coded distributed computing (CDC) suffers from the problem of poor numerical stability. Therefore, LC-CDC based model parallel (MP) training for a deep nueral network (DNN) may have poor accuracy. To enhance accuracy, we propose to replace LC by shift-and-addition (SA) and replace matrix inversion by zigzag decoding (ZD) in the encoding and decoding process of each layer, respectively, and call the scheme Naive SAZD-CDC based MP training (N-SAZD-CDC-MP-T). However, N-SAZD-CDC-MP-T encounters the problem of bottleneck at the master node, which is caused by frequent encoding/decoding at the master node and frequent huge volume of data delivery between master and worker node. This bottleneck problem may pull down the training speed significantly. To alleviate this bottleneck problem, we further design an enhanced version, by offloading certain processing from master node to distributed encoding and updating (DEU) at the worker nodes and call it DEU-SAZD-CDC-MP-T. A proof that DEU-SAZD-CDC-MP-T automatically maitains the code structure during each iteration is provided. Extensive numerical studies show that the prediction accuracy of SAZD-CDC-MP-T improves significantly over that of Poly (which is representative of LC) based scheme. In addition, the training speed of DEU-SAZD-CDC-MP-T over N-SAZD-CDC-MP-T is improved significantly.

Keywords: cdc; sazd cdc; encoding updating; coded distributed; distributed encoding

Journal Title: IEEE Transactions on Parallel and Distributed Systems
Year Published: 2023

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended