卷積神經網路(CNN)近年被廣泛的運用在影像識別和自然語言處理等領域,也帶領了人工智慧領域復興的風潮。而CNN又因其所需要的高度運算複雜度和在終端系統執行時所需要的即時運算(Real-time)需求,而衍生出了許多關於加速CNN的議題和研究。因卷積層(Convolution layer)的運算佔據了CNN中運算總運算量的90%以上,因此本論文主要針對Convolution layer做硬體加速運算,藉此來提升CNN在推導(Inference)時的速度。 本篇論文將採取分割疊加之快速傅立葉轉換(Overlap-Add FFT)的方式來取代卷積運算中的滑動窗口(sliding-window)運算,該方法在現代的深度卷積神經網路(DCNN)模型下,可以有效的降低運算複雜度達40%~50%。我們將該方法寫至可程式化邏輯閘陣列(FPGA)中來實驗Overlap-Add FFT在終端系統(edge-device)中對CNN加速情形。 我們針對此種算法在FPGA上設計一個多叢集的處理架構,每次同時運算四片特徵圖,藉此提升整體的產量。此外,我們同時將分割和疊加以及其他層如激活函數等運算交付給CPU端進行處理。我們將此方法套用在ResNet-34上面,並在Xilinx的ZCU104開發版上進行Cifar10數據集的圖像辨識,希望藉此提升影像辨識的速度。
Convolutional Neural Networks (CNNs) has been widely applying on image recognition and natural language processing, and leads to the trending of A.I. region. Because of CNN’s high computationally complexity and real-time requirement on embedded system, there are more and more research target on accelerating CNNs. Moreover, because more than 90% of CNN’s computation happens at the CONV layers, we focus on accelerating CONV layer on hardware to speed up the inference of CNN. In this thesis, we adopt Overlap-Add-FFT method instead of traditional sliding-window Convolution, to reduce 40%~50% of operation in state of the art CNN architectures. We implement Overlap-Add-FFT on FPGA and adopt pipeline and parallel processing techniques on the target architecture to accelerating the computation. We design a multi-cluster architecture on FPGA for this algorithm. The architecture can process four feature map at the same time to increase the throughput. Moreover, we make split, overlap, and other layer operations like activation processing on CPU. We set up our experiment on Xilinx ZCU104 board, and inference ResNet-34 on Cifar-10 dataset, hope that can accelerate the speed of vision recognition.