基於FPGA的多精度神經網路推理加速系統

近年來，神經網路開始被應用在各個領域，如影像辨識、自然語言處理…等，並且都取得了相當不錯的成果。這些研究結果促使了神經網路相關的產品的問世，凸顯了在終端裝置進行神經網路運算的重要性。因此，如何在維持正確率的前提下，以更低的功耗執行更快的神經網路推理運算，便成為了近年研究的重點。本文基於已經被廣泛使用的線性量化技術，引入了更低複雜度的數字格式，並融合了相關研究，提出了完整的混合精度神經網路量化演算法。在各種應用、資料集、模型架構中，我們的演算法都能得到與FP32模型相近的準確率。除此之外，我們也基於此演算法，設計了對應的推理加速電路，並實現在FPGA上。我們的電路不僅支援三種格式的權重，透過im2col演算法，更能計算各種參數設定下的卷積神經網路，提升了硬體的使用彈性。最後，我們將演算法、硬體的控制流程與PyTorch進行整合，打造出了一套由量化神經網路訓練到FPGA加速部屬都能支援的完整解決方案。我們在多個神經網路架構中驗證了此系統。平均下來，我們的FPGA的加速系統的運算速度為CPU的2.96倍，能源效率則是CPU的11.43倍。

關鍵字

機器學習；卷積神經網路；多精度神經網路；量化神經網路；推論加速系統； FPGA

並列摘要

In recent years, neural networks have been applied in various fields, such as image recognition, natural language processing, etc., and have achieved quite good results. These research results have prompted the advent of neural network-related products, highlighting the importance of neural network operations on edge devices. Therefore, how to perform faster neural network inference operations with lower power consumption while maintaining the correct rate has become the major research focus in recent years. Based on the linear quantization technology that has been widely utilized, this thesis introduces a lower complexity digital format, and integrates the algorithm proposed in related research, and proposes a complete mixed-precision neural network quantization algorithm. In various applications, datasets, and model architectures, our algorithm can achieve similar accuracy compared with the FP32 models. In addition, we also designed the corresponding inference acceleration circuit based on the proposed algorithm and implemented it on the FPGA. Our circuit not only supports three formats of weights but also can support convolutional neural networks under various parameter settings through the im2col algorithm, which improves the flexibility of the hardware. Finally, we integrated the algorithm and hardware control flow with PyTorch to create a complete solution that can support quantized neural network training and FPGA acceleration. We have verified this system in multiple neural network architectures. On average, the throughput of our FPGA acceleration system is 2.96 times that of the CPU, and the energy efficiency is 11.43 times that of the CPU.

並列關鍵字

machine learning ； convolutional neural network ； multi-precision neural network ； quantized neural network ； inference acceleration system ； FPGA

參考文獻

[1] O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211-252, 2015.

Google Scholar

[2] J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in Proc. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788.

Google Scholar

[3] R. Girshick, J. Donahue, T. Darrell and J. Malik, “Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation,” in Proc. 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 580-587.

Google Scholar

[4] W. Liu et al., “SSD: Single shot multibox detector,” in Proc. Conference on Computer Vision, 2016, pp. 21-37.

Google Scholar

[5] A. Vaswani et al., “Attention is all you need,” Advances in Neural Information Processing Systems, pp. 5998-6008, 2017.

Google Scholar

國際替代計量

基於FPGA的多精度神經網路推理加速系統

全文下載

主題瀏覽