透過您的圖書館登入
IP:18.226.187.24
  • 學位論文

可重組的卷積神經網絡加速器設計

A Reconfigurable CNN Accelerator Design

指導教授 : 楊佳玲

摘要


隨著卷積神經網絡的參數量逐漸增大,卷積神經網絡加速器的性能和能量效率成為一個重要問題。 從之前的設計中,我們可以發現,因為資料量的龐大,DRAM存取佔能耗的很大一部分。觀察卷積層的運算行為,可以發現計算中有許多參數可以被共用,但是這些參數可能因為加速器上的儲存空間有限會被重複從DRAM讀取。 所以我們希望可以通過加速器中的儲存空間來重複使用參數,以減少對計算中對DRAM的讀取。 參數的重複使用可以分為三種,一種是以重複使用輸入的參數、一種是重複使用濾波器,另一種則是重複使用中間產物的參數。 卷積神經網絡模型中的每個層都可以根據其輸入,輸出和濾波器的大小來支持不同的數據重用策略。 但現有的卷積神經網絡加速器只關注通過卷積神經網絡處理的一種數據重複使用。 為了在卷積神經網絡處理中為每一層使用不同的數據復用策略具有靈活性,我們想提出一種可重新配置的卷積神經網絡的加速器設計,可以彈性的配置利用不同類型的數據重用,來最小化DRAM存取的資料量。 通過將卷積神經網絡處理分為不同輸入和濾波器卷積單位的計算單元,我們可以通過在加速器中排列這些計算單元的計算順序來使用不同的數據重用。 而加速器的將會根據事先分析模組產生的策略所生成的指令來執行。 我們的結果展示了卷積神經網絡加速器設計中使用重組後可以使得DRAM存取的量減少,比較在使用不同資料重複使用的策略下,執行時間的差異。 並且我們也實驗了不同硬體限制的情況下,對於執行結果所造成的分析結果。

並列摘要


With the large size of the convolutional neural network (CNN), performance and energy efficiency of CNN accelerator become an important problem. From previous works, we can find that DRAM accesses took a large part in energy consumption. To reduce DRAM accesses, we observe the computation behavior of convolutional layer, and many parameters are shared between computation. Those data may be loaded on-chip repeatedly with the limitation of on-chip buffer size in an accelerator. We would like to capture data reuse via the on-chip buffer to reduce DRAM accesses of CNN computation. There are three kinds of data reuse can be captured, and those data will be kept by on-chip buffer and be evicted when not needed. The first kind of data reuse is input feature map reuse, the next is filter reuse and the other is intermediate feature map reuse. Each layer in a CNN model may favor different data reuse policy based on the size of its input, output, and filters. But existing CNN accelerators only focus on one type of data reuse through CNN processing. To have flexibility using different data reuse policy for each layer in CNN processing, we would like to propose a reconfigurable CNN accelerator design, which can be configured to capture different types of reuse with the objective of minimizing off-chip memory accesses. With separating the CNN processing into several computation primitives which are units of convolution with different inputs and filters, we can reuse different data by arranging the computation ordering of those computation primitives in our accelerator. And our accelerator will execute based on the instructions generated by off-line generator considering the optimal reuse policy and hardware constraints. Our work shows that with our reconfigurable design, DRAM accesses can be reduced, and compare the execution time and the energy when using different data reuse policy. We also analyze the effect of the different configuration in our CNN accelerator design.

參考文獻


[3] Z. Du, A. Lingamneni, Y. Chen, K. Palem, O. Temam, and C. Wu. Leveraging the error resilience of machine-learning applications for designing highly energy efficient accelerators. In Design Automation Conference (ASP-DAC), 2014 19th Asia and South Pacific, pages 201–206. IEEE, 2014.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[7] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[8] O. Temam. A defect-tolerant accelerator for emerging high-performance applications. In Computer Architecture (ISCA), 2012 39th Annual International Symposium on, pages 356–367. IEEE, 2012.
[1] M. Alwani, H. Chen, M. Ferdman, and P. Milder. Fused-layer cnn accelerators. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on, pages 1–12. IEEE, 2016.

延伸閱讀