This paper presents a comprehensive suite of techniques for optimized memory management in multi-GPU systems to accelerate deep learning application execution. We employ a hybrid utilization of GPU and CPU… Click to show full abstract
This paper presents a comprehensive suite of techniques for optimized memory management in multi-GPU systems to accelerate deep learning application execution. We employ a hybrid utilization of GPU and CPU memories in a multi-GPU environment by effectively addressing contention issues in the shared interconnect (e.g., PCIe, NVLink). In addition, we designed and implemented an intelligent prefetching algorithm (from CPU memory to GPU) that achieves the highest processing throughput while sustaining a large mini-batch size. We successfully implemented our optimization techniques on TensorFlow, and performed extensive experiments in various multi-GPU environments including traditional PCIe and the latest high-bandwidth interconnect, NVLink. Evaluation results show that our proposed scheme actually improves computing performance by decreasing the I/O bottleneck, and effectively increasing the mini-batch size without sacrificing overall training throughput.
               
Click one of the above tabs to view related content.