針對圖形處理器上的深度學習反向傳播計算之排程優化

許多大型的深層人工神經網絡模型在近幾年內已經被提出以實現更精確的訓練效果。這些巨大網路模型的訓練依賴大量的記憶體空間與通訊量，而這些都是在提高深度學習的效能方面具有挑戰性的問題。在本文中，我們分析了訓練深層神經網絡的資料存取模式，並提出一個減少圖形處理器上資料使用量與圖形處理器和主機之間的資料搬動量的資料固定演算法。我們證明了尋找一個最佳資料搬移的排程是NP完全的，並提出了一個可以找到最佳解的偽多項式時間的動態規劃演算法。亦即，我們觀察了深層神經網絡訓練的存取模式，並提出圖形處理器上專門的資料固定演算法，以最小化不必要的資料搬移。我們接著實施動態規劃，以訓練真正的深度學習模型。實驗證實，與 GeePS (一個先進的深度學習框架)相比，我們可以固定多20％的資料到圖形處理器的記憶體之中。此外，我們還提出了一個用於深度學習反向傳播計算的記憶需求減少技術。我們分析了深度學習反向傳播計算的存取模式，並認識到梯度計算與權重更新這兩個依序完成的主要步驟可以部分重疊。此外，我們分析了計算的語義，並意識到通過延遲權重更新，我們可以避免傳統的平行實作中由於讀寫衝突所需要的雙緩衝。我們接著實現我們的技術，並將參數梯度所需的記憶體使用率降低了75％。

關鍵字

深度學習；反向傳播；動態規劃；記憶體優化

並列摘要

Many large deep neural network models have been proposed in recent years to achieve more accurate training results. The training of these large models requires a huge amount of memory and communication, which becomes a challenging issue in improving the performance of deep learning. In this paper, we analyze the data access pattern of training a deep neural network and propose a data pinning algorithm that reduces the data usage on GPU and the movement between a GPU and its CPU host. We show that to find an optimal data movement scheduling is NP-complete,and propose a dynamic programming that can find the optimal solution in pseudo polynomial time. That is, we observe the access pattern of the training of the deep neural network and propose specialized GPU data pinning algorithm that minimizes the unnecessary data movements. We then implement our dynamic programming on to train real deep learning models. The experiments show that we can pin up to 20% more data into GPU memory than GeePS, a state of art deep learning framework. We also propose memory reduction technique for back-propagation in deep learning. We analyzed the access pattern of back propagation in deep learning and realized that gradient computation and weight update, two transitionally sequentially done major steps, can be partially overlapped. In addition, we analyzed the semantics of the computation and realized that by delaying weight update we can avoid double buffering due to read/write conflicts in traditional naive parallel implementation. We then implement our techniques and observe up to 75% reduction in memory usage.

並列關鍵字

Deep Learning ； Back-propagation ； Dynamic Programming ； Memory Optimization

參考文獻

[1]Backpropagation.https://en.wikipedia.org/wiki/Backpropagation.

Google Scholar

[2]cublas.https://developer.nvidia.com/cublas.

Google Scholar

[3]Cuda-unified-memory.https://devblogs.nvidia.com/unified-memory-in-cuda-6/.

Google Scholar

[4]cudnn.https://developer.nvidia.com/cudnn.

Google Scholar

[5]M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving,M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané,R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas,O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensor-flow: Large-scale machine learning on heterogeneous distributed systems.CoRR,abs/1603.04467, 2016.

Google Scholar

國際替代計量

針對圖形處理器上的深度學習反向傳播計算之排程優化

未授權

主題瀏覽