基於循環矩陣之神經網路的優化吞吐量軟硬體協同設計

卷積神經網路(Convolution Neural Network)於近年的發展中展現了其潛力，隨著CNN的加深其權重數量也大量上升，使得將訓練好的CNN模型移植到嵌入式系統的難度開始提升，因此也開始出現針對減少權重數量的研究。本文運用軟硬體協同的方式，除了在硬體上實踐以循環矩陣方式替代全連接層原本的權重使儲存負擔降低以外，還提出了整流線性單位函數(Rectified Linear Unit，ReLU)後的運算值為0的情況下，將其排除於運算之外，以減少整體的運算量的構想。且相較於一般將一份新樣本送入到完成推論(inference)才進行下一份樣本，本文配合軟體端程式將不同組的Input data進行組合使其同時進行運算，也可使得權重能夠被有效利用，同時使用直接記憶體存取(Direct Memory Access，DMA)的方式將資料進行傳輸降低傳輸花費時間。實驗結果顯示在使用本文提出的架構下，較一般的方式可以節省約40%左右的時間，而GOPS數值為3.2

關鍵字

軟硬體協同設計； FPGA ；全連接層；循環矩陣； ReLU

並列摘要

In recent years, Convolution Neural Network(CNN) has shown its potential. With the deepen of CNN, the number of weights has also increased significantly. It makes difficulty of transplanting pre-trained CNN model to embedded systems begin to increase. Therefore, researches on reducing the number of weights have begun to appear. We use software-hardware codesign to development. In addition to the achieve of replacing the original weight of the fully connected layer with circular matrix in hardware the storage burden is reduced. We also proposed the conceive that if the calculated value after Rectified Linear Unit (ReLU) is 0, it is excluded from the calculation to reduce the overall amount of calculation. Compared to generally sending a next sample after the new sample inference is completed, we use the software-side program to combine different sets of input data to make them operate at the same time. It also makes the weights to be used effectively. At the same time, we use Direct Memory Access(DMA) to transfer data to reduce the time it takes to transfer. Experimental results show that using the architecture which we proposed, can save about 40% of the time compared with the general method, and the GOPS value of our method is 3.2 .

並列關鍵字

Software-Hardware codesign ； FPGA ； Fully connected layer ； circulant matrix ； ReLU

參考文獻

[1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F.-F. Li, "ImageNet Large Scale Visual Recognition Challenge," in International journal of computer vision115.3, 2015.

Google Scholar

[2] C. M. Bishop, Pattern Recognition and Machine Learning, Springer Science+Buisness Media, LLC, 2006.

Google Scholar

[3] Y.H.Chen, Student, T. Krishna, J. S. Emer, and V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE J. Solid-State Circuits, vol. 52, no. 1, pp. 127-138, Jan 2017.

Google Scholar

[4] L.Bai , Y.Zhao, and X. Huang, "A CNN Accelerator on FPGA Using Depthwise Separable Convolution," IEEE Trans. Circuits Syst. II: Express Briefs, vol. 65, no. 10, p. 1415–1419, Oct 2018.

Google Scholar

[5] L.Du, Y.Du, Y.Li, J.Su, Y.C. Kuan, C.C. Liu, and M.C.F. Chang, "A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things," IEEE Trans. Circuits Syst. I: Regular Papers, vol. 65, no. 1, pp. 198-207, Feb 2018.

Google Scholar

國際替代計量

基於循環矩陣之神經網路的優化吞吐量軟硬體協同設計

全文下載

主題瀏覽