自從 AlexNet 在 2012 年的 ImageNet challenge 的突破後,深度神經網路 (DNN) 已經在眾多領域展現其價值。而現今許多 DNN 的硬體加速器設計都是採用小的晶片上快取 (onchip cache) 搭配大的晶片外記憶體 (offchip memory) 以避免頻繁的資料讀寫耗費太多時間或能量。然而,隨著科技及晶片製程的演進,除了上述的設計外,硬體設計者開始擁有更多的記憶體設計的選項。因此擁有一個用來衡量各種記憶體搭配的優劣利弊的工具變得重要。 然而,現存的工具存在以下的限制: 1) 只能用於推論 (inference),不能用於神經網路的訓練(training) 2) 只用圖像辨識的神經網路作為主要的效能評估指標 3) 只有模擬卷積層 (convolutional layer) 內部的資料流 (dataflow),而忽略其他例如批正規層 (batch normalization layer)、活化層 (activation layer) 等層影響。我們認為神經網路的訓練對於拓展應用領域或是研究更有效率的網路結構皆極其重要,且除了卷積層及全連接層以外的層,在神經網路中訓練也具有不可忽略的影響。 在這篇論文中,我們提出了一個著重於記憶體的神經網路訓練效能分析模型。這個分析模型以神經網路結構、晶片上快取的容量、晶片外快取的頻寬作為輸入參數,假設採用幾近最佳化的軟體管理快取 (softwaremanaged cache) 以避開快取設計中實作細節對效能的折扣,預估這組輸入參數下能夠得到的訓練效能,例如訓練一回合需要的執行時間、平均頻寬、資料搬移量等等。 這篇論文具有以下貢獻: 1) 提出一個可以用於評估整個深度神經網路訓練過程效能的模型,並且有將過程中的所有層皆考慮進去,而非只考量某些計算量較大的層。 2) 對於深度神經網路中各種規模的資料再利用提出徹底的分析。 3) 提出幾項對於現行神經網路的觀察及建議以提供未來深度神經網路的研究及優化可著重的方向。
Since the breakthrough of AlexNet in 2012 ImageNet challenge, Deep Neural Networks (DNNs) has been proving the effectiveness in various computing fields. Many hardware designs combine a small onchip cache with large offchip memories to prevent expensive memory access. With the progress of technology, hardware designers are having more and more design choices. A tool to estimate the tradeoffs among all memory design parameters is thus important. However, existing tools are limited to i) inference only; ii) image classification networks as the primary benchmark for evaluation; iii) only model the dataflow within the convolutional layers, neglecting other layers like batch normalization and activation. We believe that training networks is still important to extend the applications and develop more efficient model structure, and layers except convolutional and fullyconnected layers still play an nonnegligible role in DNN training. In this work, we propose an analytical model for DNN training with focus on memory. The analytical model takes the DNN model, onchip cache capacity, offchip memory bandwidth as input, and assumes a nearoptimal softwaremanaged cache to bypass the issue of implementation detail in the cache design, to estimate some performance metrics like execution time, average bandwidth, memory traffic, etc., of a training iteration. This work has the following contributions: i) proposing an analytical model to estimate the performance of a whole DNN training iteration rather than some selected layers, ii) provide a thorough analysis of DNN architecture in all scales, iii) several observations and recommendations on where future DNN training research and optimization should be focused are proposed.