深度網路訓練中反向傳播的GPU內存使用優化

在現代深度學習中，設計更大的深度神經網絡（DNN）來執行更複雜的任務和更高的準確性已經成為一種趨勢。在另一方面，卷積神經網絡（CNN）已成為大多數計算機視覺任務的標準方法。但是，那這些卷積層的中間數據的內存分配可能會在模型訓練期間造成嚴重的內存壓力。許多解決方案已經被提出來解決該問題。除了依賴於硬件的解決方案之外，有一個通用方法稱為使用運算換取記憶體空間，它可以通過增加計算量來減少 GPU 內存的使用。它延遲了前傳遞過程中部份層的子集的激勵計算，並在後向階段批量重新計算它們，以節省 GPU 內存。在這篇論文中，我們將專注於有效率地找到最佳檢查點以在模型訓練期間達到最小記憶體峰值。我們首先會描述訓練神經網絡的理論背景以及所用到的數學方程。我們使用這些方程來確定前傳導以及倒傳遞過程中必須要用到的所有資料以計算模型的權重。我們首先確定檢查點選擇問題並提出時間複雜度為 O(n^3) 的動態規划算法解決尋找最優檢查點子集問題。通過大量的實驗，我們使用理論分析做出更準確的描述並基於我們的追蹤修正目標函數，並提出一個 O(n^2) 動態規划算法來查找最優檢查點子集。

關鍵字

深度學習；動態規劃；記憶體優化；記憶體壓力；檢查點

並列摘要

In modern Deep Learning, it has been a trend to design larger Deep Neural Networks(DNNs) for the execution of more complex tasks and better accuracy. On the other hand, Convolutional Neural Networks(CNNs) have become the standard method for most computer vision tasks. However, the memory allocation for the intermediate data of these convolution layers can cause severe memory pressure during model training. Many solutions have been proposed to resolve the problem. Besides hardware-dependent solutions, the general methodology known as trading computation for memory or rematerialization can reduce GPU memory usage by trading computation for memory efficiently. It delays the computation of activations of a subset of layers during the forward phase to save GPU memory and recomputes them in batch during the backward phase. In this paper, we will focus on efficiently finding the optimal checkpoint subset to achieve the least peak memory usage during the model training. We first describe the theoretical background of the training of a neural network using mathematical equations. We use these equations to identify all essential data required during both forward and backward phases to compute the gradient of weights of the model. We first identify the checkpoint selection problem and propose a dynamic programming algorithm with time complexity O(n^3) to solve the problem of finding the optimal checkpoint subset. With extensive experiments, we formulate a more accurate description of the problem using our theoretical analysis and revise the objective function based on the tracing, and propose an O(n^2) dynamic programming algorithm for finding the optimal checkpoint subset.

並列關鍵字

Deep Learning ； Dynamic Programming ； memory usage optimization ； memory pressure ； Checkpointing

參考文獻

O. Beaumont, L. Eyraud-Dubois, and A. Shilova, “Efficient combination of rematerialization and offloading for training dnns,” in Neural Information Processing Systems, 2021.

Google Scholar

T. Chen, B. Xu, C. Zhang, and C. Guestrin, “Training deep nets with sublinear memory cost,” 2016.

Google Scholar

C. chin Huang, G. Jin, and J. Li, “Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping,” Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, 2020.

Google Scholar

J. Feng and D. Huang, “Optimal gradient checkpoint search for arbitrary computation graphs,” 2021.

Google Scholar

A. N. Gomez, M. Ren, R. Urtasun, and R. B. Grosse, “The reversible residual network: Backpropagation without storing activations,” in NIPS, 2017.

Google Scholar

國際替代計量

深度網路訓練中反向傳播的GPU內存使用優化

主題瀏覽