透過您的圖書館登入
IP:18.217.8.82
  • 學位論文

多繪圖處理器平台下運用模型平行化以實現高效並可靠之深度學習訓練

Efficient and Robust Pipeline Design for Multi-GPU DNN Training through Model Parallelism

指導教授 : 楊佳玲

摘要


深度類神經網路的訓練需要大量運算,經常花費數天至數個禮拜才能完成。因此,運用多繪圖處理器平行計算以加速訓練深度類神經網路是現今常用的方法。其中資料平行化 (data parallelism) 由於其易於實作,是目前主流的作法。然而,使用資料平行化經常導致大量的繪圖處理器間資料傳輸 (inter-GPU communication) 而影響效能。另一種平行化的方式為模型平行化 (model parallelism) ,作法是讓各繪圖處理器分別負責一部分的類神經網路模型,此方法大幅降低了繪圖處理器間資料傳輸,但衍生出負載平衡 (load balance) 及權重老舊 (staleness issue) 的問題需要解決。本論文中,我們提出了一個創新的模型平行化方法,利用同步執行前向計算 (forward pass) 和後向計算 (backward pass) 以達到負載平衡,及提出權重預測 (weight prediction) 的機制以緩解權重老舊 (staleness issue) 的問題。 實驗結果顯示,我們的方法可以得到相比於資料平行化多達 15.77 倍的加速,及與目前最新的模型平行化演算法相比取得多達 2.18 倍的加速,且不影響訓練的準確率。

並列摘要


The training process of Deep Neural Network (DNN) is compute-intensive, often taking days to weeks to train a DNN model. Therefore, parallel execution of DNN training on GPUs is a widely adopted approach to speed up process nowadays. Due to the implementation simplicity, data parallelism is currently the most commonly used parallelization method. Nonetheless, data parallelism suffers from excessive inter-GPU communication overhead due to frequent weight synchronization among GPUs. Another approach is model parallelism, which partitions model among GPUs. This approach can significantly reduce inter-GPU communication cost compared to data parallelism, however, maintaining load balance is a challenge. Moreover, model parallelism faces the staleness issue; that is, gradients are computed with stale weights. In this thesis, we propose a novel model parallelism method, which achieves load balance by concurrently executing forward and backward passes of two batches, and resolves the staleness issue with weight prediction. The experimental results show that our proposal achieves up to 15.77x speedup compared to data parallelism and up to 2.18x speedup compared to the state-of-the-art model parallelism method without incurring accuracy loss.

參考文獻


[1] 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 2016.
[2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
[3] A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. In M. Palmer, R. Hwa, and S. Riedel, editors, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 440–445. Association for Computational Linguistics, 2017.
[4] D. Alistarh, J. Li, R. Tomioka, and M. Vojnovic. QSGD: randomized quantization for communication-optimal stochastic gradient descent. CoRR, abs/1610.02132, 2016.
[5] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.

延伸閱讀