基於浮點正負號位元運算FPGA電路的卷積神經網絡訓練加速系統設計

近年來因著電腦科技於運算能力的提升，讓卷積神經網路所能解決影像處理的問題複雜度遠比傳統的電腦視覺演算法來說難上許多；其優異的表現帶起了廣泛研究的風潮，而在影像分類任務的正確率已經達到甚至超過人類的辨認正確度後，研究便逐漸轉往尋求如何以更低的功耗和更有效率方式去完成訓練任務。在卷積神經網路的訓練過程中，會不斷的利用前向傳遞和反向傳遞去調整網路的權重值，逐步在損失面中尋找最低點，以便得一最佳模型；然而此過程中會需要大量的計算，而本論文中採用的FloatSD8便旨在降低此過程中所需要的計算複雜度，以較低精準度的數值表示做訓練，卻依然能得到與傳統單精度浮點數訓練出來的模型，有著相近的正確率表現結果。本論文在模擬的階段，除了降低權重值到8位元寬的FloatSD8外，其餘在前向傳遞還有反向傳遞過程中的特徵影像值和梯度也採用量化減低位元寬，以降低複雜度和提升整體運算的吞吐量，最後，本論文也為了降低在訓練過程中數值累加運算的位元寬，從單精度變為半精度浮點數，故採用由NVIDIA公司所維護，源自於柏克萊大學人工智慧研究中心所開發的Caffe平台的分支，NVCaffe，作為修改開源碼的平台並模擬半精度累加。在三種影像辨認的資料集: MNIST、CIFAR-10和ImageNet中，在MNIST和CIFAR-10得到與單精度相比有相似甚至較佳的訓練成果，而ImageNet在使用ResNet-50並搭配FloatSD8與量化參數至8到7位元寬的訓練，其top-5正確率仍有90.99%，與單精度浮點數版本相比僅落後0.56%。除演算法模擬外，本論文有針對此FloatSD8的演算法設計其加速核心運算單元，此運算單元支援前向與反向傳遞，最後亦有架構設計整個加速訓練的FPGA軟硬整合版本，相比於單精度運算的CPU平台，在訓練小型的lenet網路上，整體系統運算速度提升了4.7倍，而卷積運算在前向還有反向傳遞的計算中，運算速度提升了6.08倍。

關鍵字

卷積神經網路；訓練加速系統； FloatSD8 ；半精度累加

並列摘要

In recent years, due to the advancement of computing power in computer technology, the image processing problems solved by convolutional neural networks (CNN) have become more complicated than traditional computer vision algorithms. The outstanding performance of CNN has sparked a broad research boom. After the accuracy of image classification tasks by CNN has reached or exceeded that by human beings, researchers have gradually moved to seek ways to lower power consumption and more efficient way for CNN training. In CNN training, the training data is passed through the CNN in a forward direction; the output errors are passed backward through the network; and the CNN weights are adjusted. By doing so, one can seek the global minimum on the loss surface to obtain the optimal CNN model. The FloatSD8 used in this thesis aims to reduce the computational complexity required in this process and to train with lower precision numerical representations. In addition, the FloatSD8 training scheme can achieve similar accuracy results as the one trained with single-precision floating-point arithmetic (FP32). In the simulation, in addition to reducing the weight to 8 bits, the remaining variables, such as the feature map values in the forward propagation, gradients in the backward propagation, are also quantized to reduce the computational complexity and improve the training throughput. Moreover, the precision of the accumulation in the convolution process is reduced from single-precision (FP32) to half-precision floating-point numbers (FP16). NVCaffe, a branch of the Caffe platform developed by Berkeley Artificial Intelligence Research center (BAIR), is maintained by NVIDIA. We implement the half-precision accumulation feature by modifying the source code of NVCaffe. In the three famous image classification datasets: MNIST, CIFAR-10, and ImageNet, MNIST and CIFAR-10 have similar or even better accuracy results than the single-precision floating-point version, while ImageNet uses ResNet-50 with FloatSD8 and quantizing other parameters ranging from 8 to 7 bits, the top-5 accuracy result is still 90.99%, which is only 0.56% lower than the FP32 version. In addition to algorithmic simulation, we also designed a processing element (PE) for the implementation of the proposed FloatSD8 training scheme. This PE supports forward and backward propagation. Finally, we built an integrated CNN training acceleration system consisting of FPGA hardware and control software. Compared to the single-precision CPU platform, the overall training process of the Lenet CNN for the MNIST database can speed up by 4.7X. In the calculation of the convolution operations of the forward and backward propagation, the operation speedup is 6.08X.

並列關鍵字

convolution neural network(CNN) ； training acceleration ； FloatSD8 ； FP16 accumulation

參考文獻

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.

Google Scholar

[2] V. Nair and G. E. Hinton. “Rectified linear units improve restricted boltzmann machines,” in Proc. of 27th International Conference on Machine Learning, 2010.

Google Scholar

[3] Mini-Batch Gradient Descent - Large Scale Machine Learning

Google Scholar

[4] Unsupervised Feature Learning and Deep Learning Tutorial. [online] Available at: http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/ [Accessed 21 Oct. 2019]

Google Scholar

[5] K.-H. Chen, C.-N. Chen, and T.-D. Chiueh, “Grouped signed power-of-two algorithms for low-complexity adaptive equalization,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 52, no. 12, pp. 816–820, Dec. 2005.

Google Scholar

國際替代計量

基於浮點正負號位元運算FPGA電路的卷積神經網絡訓練加速系統設計

全文下載

主題瀏覽