低複雜度卷積神經網路訓練與其低功耗運算單元電路設計

近幾年，深度神經網路(Deep Neural Network)與人工智慧研究因進步的電腦科技而再度被廣泛研究。神經網路有數種類型，包括: 多層感知器(MLP)，卷積神經網路(CNN)，遞歸神經網路(RNN)等，其中卷積神經網路又被廣泛地應用在影像處理上，諸如影像辨識，物件偵測，自然語言處理，甚至下圍棋；到了最近，卷積神經網路深度已經可含有百層以上，能解決困難的任務，但是同時，計算上複雜度與傳統多層感知器相比也提高許多。卷積神經網路透過不斷前傳(Forward)影像通過網路計算，與反向(Backward)傳遞誤差值經網路計算，調整網路權重，在損失面上找尋最低點，直到找到最低點為止，得到一最佳模型；有了最佳模型，只需將輸入資料通過此網路即可得到網路推理結果。可見在訓練階段需要消耗大量的計算。本論文使用Floating-point signed digit (FloatSD)演算法，套用在網路訓練與推理上以減輕計算複雜度。另外，我們再針對網路訓練與推理過程中的各層神經元輸出，以及反向傳遞錯誤值做量化以節省更多的計算。我們證實深度卷積神經網路在訓練時不需要32位元浮點數，即可達到相近的結果。本論文使用柏克萊大學人工智慧研究中心(BAIR)所開發的Caffe平台做為平台，透過修改Caffe的原始碼實現FloatSD以及其他參數的量化演算法。我們使用三種影像辨識領域的指標資料集： MNIST、CIFAR-10、ImageNet (ILSVRC)三種應用做實驗，結果證實在小型影像辨識如MNIST與CIFAR-10上，FloatSD訓練甚至比浮點數訓練還佳；即便拓展到大型影像辨識如ImageNet上，不需要以浮點數預先訓練的權重，直接用FloatSD演算法即可從頭開始訓練，以top-5正確率超過90%的網路為實驗，得到與浮點版本相差僅0.8%的結果。本論文除了軟體模擬外，亦針對FloatSD設計其運算單元硬體電路，是為組成正在設計中的通用型神經網路晶片之運算單元。使用FloatSD演算法，時脈閘控，與零項排序技術後，與32位元浮點數版本電路相比，面積是其16.6%，功耗則是0.72%至10.8%。

關鍵字

深度神經網路；卷積神經網路；量化； ImageNet

並列摘要

In recent years, deep neural networks and AI research had attracted much attention. There are several types of neural networks, including multilayer perceptron (MLP), convolution neural network (CNN), recurrent neural network (RNN). Among these architectures, convolution neural network had been widely used in image processing task including, but not limited to image classification, object detection, natural language processing, even GO games. Recently, it has been showed that CNN can be built with hundreds of layers in order to solve tough tasks, however, at the same time require much more computing effort compare to traditional MLP. Convolution neural network can be trained by iteratively passing training data forward through network, passing output error backward through network, adjusting weight of network, traversing on the loss surface to get to the global minimum for the best model. With the trained model, one can pass the data through the network once and get the inference result. We introduced the Floating-point signed digit (FloatSD) algorithm for training and inference phase of CNN to reduced computational effort. In addition, we quantize the neuron output of each layer and the backward delta error for more computational saving. We show that it's not necessary to use 32 bit floating point at training phase in order to get similar results. We implement our FloatSD and quantizing algorithms by modifying the source code of the well known deep learning framework called Caffe, developed by Berkeley Artificial Intelligence Research center. Three famous image classification dataset: MNIST, CIFAR-10, ImageNet are used throughout our experiments. Results show that we can get better result at MNIST and CIFAR-10 datasets. Even at ImageNet dataset, we are able to train from scratch by our proposed algorithm and obtain a 90%-top-5-accuracy model and get only 0.8% degradation of top-5 accuracy. In addition to software simulation, we also design the circuit of processing element of FloatSD, which will be the computational module of our on-going general purpose neural network chip. Using FloatSD algorithm, clock gating, zero-sorting technique, our circuit area and power consumption is 16.6% and 0.72% to 10.7% of floating point version respectively compared with 32bit floating point counterpart.

並列關鍵字

deep neural network ； convolution neural network ； quantization ； ImageNet

參考文獻

[9] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, O. Temam, "DaDianNao: A machine-learning supercomputer," in Proc. of 2014 47th Annual IEEE/ACM International Symposium on MICRO, Dec 2014, pp. 609-622.

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.

[2] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. “Deep neural networks for acoustic modeling in speech recognition,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82-97, 2012

[4] D. Silver et al., “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.

[7] P. Norman. et al, “In-Datacenter Performance Analysis of a Tensor Processing Unit TM,” ArXiv:1704.04760v1 [cs.AR], 2017.

國際替代計量

低複雜度卷積神經網路訓練與其低功耗運算單元電路設計

全文下載

主題瀏覽