基於循環矩陣之神經網路的可重置加速器與軟硬體協同設計

在人工智慧發展迅速的時代背景下，嵌入式人工智慧應用是大勢所趨，為了提升人工智慧在嵌入式系統上推論與訓練的準確率，基於人工智慧的深度學習展現出優異的表現，而深度學使用到的大量運算以及龐大的運算參數量都成為嵌入式系統面臨的挑戰，在有限的處理器、記憶體和面積下如何進行快速準確的推論乃當今熱門研究方向。本論文中使用循環矩陣取代原先全連接層中的權重矩陣，對於佔了深度學習架構 90%以上參數量的全連結層，這個方法可以在小幅減少準確率的情況下大幅減少參數量，使空間複雜度從 O(n2)優化至 O(n)，在記憶體節省上效果十分顯著。本論文實驗在 SoC FPGA 上針對使用循環矩陣的全連結層進行硬體加速，並配合軟體實現基於卷積神經網路的圖像辨識。實驗顯示，在兩個常見的資料集上，辨識準確率最多下降 0.8%的情況下，參數量減少了 99%，並且設計之硬體加速器與開發平台的雙核心 ARM CortexTM-A9 處理器比較有 256 倍的速度提升。

關鍵字

神經網路； FPGA ；循環矩陣；軟硬體協同設計

並列摘要

Deep learning uses a lot of computation and a lot of parameters, which are the challenges of embedded systems, how to make quick and accurate inferences with limited resources is a hot research direction today. In this paper, we use the circulant matrix to replace the weight matrix in the fully connected layer,this method improves the space complexity from O(n2) to O(n). In our work,we design a hardware accelerator for the fully connected layer of the circulant matrix,and use software hardware co-design realize image recognition based on convolutional neural network.Experiments show that on the two standard datasets, when the inference accuracy drops up to 0.8%, the parameter is reduced by 99%,and the designed hardware accelerator is 256 times faster than the dual-core ARM CortexTM-A9 processor of the development platform.