基於威諾格演算法架構的卷積神經網路加速器

由於現代人工智慧的發展使得深度學習技術被廣泛使用，卷積神經網路(Convolutional Neural Network, CNN)在圖像辨識方面有高準確度的優點，但也因近年來神經網路架構趨向複雜化導致運算量與運算時間增加，為了使推理過程(Inference)達到即時運算(Real-time)的效能，即衍生出CNN運算加速的相關研究。 CNN運算過程中極大部分的計算需求在於卷積(Convolution)運算的過程，所以本文提出一套基於威諾格演算法(Winograd algorithm)[1]的硬體架構設計，目的是利用威諾格演算法來減少卷積運算所需的乘法次數，降低卷積層的計算複雜度，並且使用與以往相關研究不同的硬體實現設計，優化威諾格卷積(Winograd convolution)執行矩陣轉換所花費的時間。除了步長為1 (Stride-1)的威諾格卷積運算，本文提出的管線化(Pipeline)架構也可應用於基於分解方法[2]拆解的步長為2 (Stride-2)的威諾格卷積運算。實驗的神經網路採用HarDNet39[3]與HarDNet68[3]，實驗比較Winograd架構上執行威諾格卷積和脈動陣列(Systolic Array)架構[4]上執行滑動窗口(Sliding window)卷積的效率，實驗結果表明，在相同數量的乘法累加器(Multiplier and accumulator, MAC unit)資源下，本文提出的架構可以減少大約40%的卷積計算週期。

關鍵字

威諾格演算法；卷積神經網路加速器

並列摘要

Due to the development of modern artificial intelligence, deep learning technology is widely used. Convolutional Neural Network (CNN) has the advantage of high accuracy in image recognition, but in recent years, the complexity of neural network architecture has led to an increase in the amount of computation and computation time. In order to make the inference process achieve the performance of real-time computing, related research on CNN computing acceleration is derived. A large part of the computing requirements in the CNN operation process lies in the process of convolution operation, so this thesis proposed a hardware architecture design based on the Winograd algorithm[1] to reduce the computational complexity of the convolutional layer, and use an approach different from previous Winograd implementations to optimize the time spent in the Winograd matrix conversion process. Except to the stride-1 computation of Winograd convolution, this thesis proposed pipeline architecture can also apply to the stride-2 computation based on a decomposition methodology[2]. The experimental neural network uses HarDNet39[3] and HarDNet68[3] to compare the efficiency between executing Winograd convolution on this thesis proposed architecture and executing sliding window convolution on the Systolic array architecture[4], experiment results show that our architecture can reduce up to 40% of computation cycles under the same number of multiplier and accumulator (MAC) resource.

並列關鍵字

Winograd ； CNN Accelerator

參考文獻

[1] A. Lavin, S. Gray, “Fast Algorithms for Convolutional Neural Networks,” Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.4013-4021, Jun. 2016.

Google Scholar

[2] J. Yepez, S.-B. Ko, “Stride 2 1-D 2-D and 3-D Winograd for Convolutional Neural Networks,” Proc. Very Large Scale Integration (VLSI) Systems, pp.853-863, Jan. 2020.

Google Scholar

[3] P. Chao, C. Y. Kao, Y. S. Ruan, C. H. Huang, Y. L. Lin, “HarDNet: A Low Memory Traffic Network,” Proc. IEEE/CVF International Conference on Computer Vision (ICCV), pp.3552-3561, Sep. 2019.

Google Scholar

[4] H. T. Kung, “Why systolic architectures?”, Proc. IEEE Computer, vol.15, pp.37-46, Jan 1982.

Google Scholar

[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Proc. Nature, vol.521, no.7553, May 2015.

Google Scholar

主題瀏覽