近年來,卷積神經網路(Convolutional Neural Network, CNN)以高複雜度計算來維持高準確率,但卷積運算在過程中會產生大量的運算量,導致大量的資料在memory中搬遷、消耗大量的能量,為了提高整體的運算效率及降低資料搬遷所需的耗能,資料的讀取順序是一大關鍵,如何在有限的硬體資源內,減少資料傳輸並增加重複使用率,因此,本論文會針對資料流(dataflow)去做優化,藉此來達到加速CNN的目的。 我們發現CNN中,各層的輸入特徵圖(ifmap)以及濾波器(filter)資料大小差異很大,若是以單一的加速器配置去執行整個網路,無法有效地運用硬體資源。因此,我們提出了依據資料大小差異來動態配置CNN每層的PE (Processing Elements)陣列、on-chip buffer (SRAM) size以及優化在不同PE陣列下的dataflow,來提高整體的PE使用率及降低資料在memory之間的存取。 本論文會以ARM所提出的SCALE-Sim (Systolic CNN AcceLErator Simulator ) [1] 進行模擬實驗,SCALE-Sim是基於脈動陣列(Systolic Array)架構下的一種CNN加速模擬器,但不同於SCALE-Sim只能針對CNN在單一硬體架構下進行評估,我們提出的改善方案可以動態配置一個最佳化之硬體架構,提供Designers去Trade Off。 從實驗結果顯示,我們提出的動態配置方法相較於原版SCALE-Sim,在HarDNet39 [2]及DenseNet121[3],不但改善了前面幾層Ifmap存取量大的問題,降低了整體6%到35%的DRAM Accesses,在HarDNet39 [2]還有效地提升了約10%至12%的PE使用率,減少運行時間。
The computation of convolutional neural network (CNN) requires a significant amount of memory accesses, which lead to a lot of energy consumption. In order to reduce energy consumption for data movement, a better dataflow to maximize data reusability and minimize data migration between on-chip buffer and external DRAM is important. Therefore, we propose an integration dataflow technique to accelerate the convolutional neural network. We find that the data size of input and filter are different on each layer of the neural network. Therefore, hardware resources cannot be effectively utilized if the hardware architecture is not reconfigurable. In this thesis, we propose to dynamically configure the PE Array, on-chip buffer, and dataflow for each layer of the convolutional neural network, in order to maximize PE utilization and minimize data migration. We use SCALE-Sim (Systolic CNN AcceLErator Simulator) [1]proposed by ARM to conduct simulation experiments. SCALE-Sim is a CNN acceleration simulator based on Systolic Array architecture. Unlike SCALE-Sim which can only evaluate results for a specified accelerator configuration and neural network architecture, our proposed improvement scheme can dynamically configure an optimized hardware architecture and dataflow. Experimental results show that our method not only effectively reduces 6%-35% external memory accesses in HarDNet39[2] and DenseNet121[3], but also improves 10%-12% of PE utilization in HarDNet39[2].