使用區塊循環矩陣與傅立葉轉換運算之遞歸神經網路的推論引擎設計

在本篇論文的研究中，主要針對閘門循環單元(Gated Recurrent Unit, GRU)中的閘門運算進行加速，我們提出了兩種架構，分別為GRU門運算硬體架構與GRU門運算之針對閘門優化硬體架構，其中我們將閘門運算中的矩陣乘法結合區塊循環矩陣的概念、近似激活函數與快速傅立葉轉換的演算法(Fast Fourier Transform, FFT)進行運算，並且針對FFT與IFFT運算單元的運作時間與使用次數進行優化將兩種架構進行區分，並且透過軟硬體協同的方式完成我們推論引擎的運算。最終使得優化架構相比於基礎版架構，FFT與IFFT運算單元的運作次數將最多降低N倍，其中N=Input Data Length(輸入向量長度)/FFT(IFFT) Length(傅立葉轉換運算長度)，隨著FFT與IFFT運作單元的運作時間與次數降低，在推論引擎的功耗表現上也會有明顯的效益，在本篇研究中，在FFT運算單元的功耗表現上，相較於基礎推論引擎，優化推論引擎可降低約60%，在IFFT運算單元的功耗上也降低約35%。在軟體設計中，我們透過定點數模擬的方式，壓縮資料的位元寬度，以降低硬體運算的負擔，並透過測量訊號量化雜訊比 (Signal-to- Quantization-Noise Ratio ,SQNR)大小，來制定我們在硬體架構中的實際規格。在硬體設計流程中，經過使用硬體描述語言的撰寫(Hardware description language ,HDL)，我們透過EDA tool Vivado測試與合成我們的硬體架構，並且透過加入實際運算頻率進行Post-Sim模擬，更貼近於實際在開發板上的運作行為，並且進行功耗上的測量與分析。最後將我們提出的GRU門運算之針對閘門優化硬體架構與其他相關文獻之實現數據進行比較與分析，其中包括硬體資源佔比、功耗、FPS(Frame Per Second)、能效(Energy Efficiency)等等，另外我們也針對硬體架構特性進行比對與分析，凸顯我們硬體架構的延展性。而針對本篇研究提出的兩種架構，我們也分別針對FFT與IFFT運算單元分別所造成的功耗進行比對，並且分析是否達到功耗上的縮減效益。

關鍵字

閘門循環單元；快速傅立葉轉換；循環矩陣；近似激活函數；硬體加速；軟硬體協同設計； FPGA

並列摘要

In our research, we dedicate to accelerate the gate operations in Gated Recurrent Unit (GRU) networks. We propose two architectures: GRU gate operation base hardware architecture and the gate-optimized hardware architecture for GRU gate operations. Our inference integrates matrix multiplication in gate operations with the concepts of block circulant matrices, approximate activation functions, and the Fast Fourier Transform (FFT) algorithm. The two architectures are distinguished by optimizing the operational time and usage frequency of the FFT and IFFT units. Through hardware-software co-design, we complete the computations of our inference engine. Ultimately, the optimized architecture can reduce the operational frequency of the FFT and IFFT units by up to N times compared to the baseline architecture, where N is defined as the Input Data Length divided by the FFT (IFFT) Length. As the operation time and frequency of the FFT and IFFT units decrease, there are significant benefits in the power consumption performance of the inference engine. In our research, the optimized inference engine reduces power consumption by approximately 60% for the FFT computation units and by approximately 35% for the IFFT computation units compared to the basic inference engine. In the software design, we simulate using fixed-point arithmetic to compress the bit-width of the data, thus reducing the computational complexity on the hardware. We establish the specifications for our hardware architecture by measuring the Signal-to-Quantization-Noise Ratio (SQNR). In the hardware design process, we use Hardware Description Language (HDL) and test and synthesize our hardware architecture with the EDA tool Vivado. We conduct Post-Simulation (Post-Sim) by incorporating the actual operational frequency to closely approximate the behavior on a development board, followed by power consumption measurement and analysis. Finally, we compare and analyze the proposed gate-optimized hardware architecture for GRU gate operations against implementation data from other related works, including hardware resource usage, power consumption, Frames Per Second (FPS), and energy efficiency. We also compare and analyze the characteristics of the hardware architectures to highlight the extensibility of our design. For both proposed architectures, we separately compare the power consumption caused by the FFT and IFFT units and analyze whether the reduction in power consumption is achieved.

並列關鍵字

Gated Recurrent Unit ； Fast Fourier Transform ； Circulant Matrix ； Approximate Activation Function ； Hardware Acceleration ； Hardware-Software CoDesign ； FPGA

參考文獻

[1] G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–97, Nov. 2012.

Google Scholar

[2] A. X. M. Chang, B. Martini, and E. Culurciello, “Recurrent neural networks hardware implementation on FPGA,” Mar. 2015, arXiv:1511.05552.

Google Scholar

[3] A. X. M. Chang and E. Culurciello, “Hardware accelerators for recurrent neural networks on FPGA,” in Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), May 2017, pp. 1–4.

Google Scholar

[4] F. Conti, L. Cavigelli, G. Paulin, I. Susmelj, and L. Benini, “Chipmunk: A systolically scalable 0.9 mm2, 3.08Gop/s/mW 1.2 mW accelerator for near-sensor recurrent neural network inference,” in Proc. IEEE Custom Integr. Circuits Conf. (CICC), Apr. 2018, pp. 1–4

Google Scholar

[5] J. Qiu et al., “Going deeper with embedded FPGA platform for convolutional neural network,” in Proc. ACM/SIGDA Int. Symp. Field-Program. Gate Arrays (FPGA), 2016, pp. 26–35.

Google Scholar

延伸閱讀

查找全文

主題瀏覽