基於梯度與零激活值比率的混合精度量化

最近幾年，卷積神經網絡在影像辨識、自然語言、物件辨識等多個領域都有得到不錯的成果。然而隨著越來越高的準確率，模型也是越來越龐大、越來越複雜，如果我們想要將這些模型部屬在邊緣裝置等硬體資源缺乏的平台上，勢必要壓縮這些龐大的模型或是加速模型推理時的運算速度。混精度量化(mixed-precision quantization)就是一種有效的模型壓縮方法，它利用各層不同的重要程度給予相應的位寬，位寬越大代表能夠保留更多資訊，各層合適的位寬配置能更大的壓縮模型大小與提高模型準確率。然而在之前的混精度量化方法中，使用搜尋法會耗費大量的時間在訓練上，而且經過激活函數的特徵在之前方法中大都是採用跟權重(weight)一樣的的位寬，或者是直接給定一個位寬做定精度的量化，這代表模型還有很大的優化空間。為解決上述問題我們提出了使用模型訓練時各層的梯度(gradient)大小來做為每層量化敏感度的依據，並根據這敏感度設計了一套訓練方法來進行更有效率的訓練。以及利用ReLU激活函數的特性，給予不同層的激活值不同的位寬，來達到更好的表現。並且我們使用的資訊都是簡單易得的，對訓練過程造成的額外負擔是微乎其微的。在CIFAR-10、Tiny ImageNet以及多種模型的實驗中，可以證明我們的方法確實有比之前的作品有更好的表現。

關鍵字

卷積神經網絡；模型量化；混精度量化；模型壓縮；加速推理

並列摘要

Recently, convolutional neural networks have shown promising results in image classification, natural language processing, and object detection. With increasing accuracy, the model becomes larger and more complex. It is necessary to compress these huge models or speed up the calculation speed for model inference if we want to deploy these models on platforms with limited hardware resources. The mixed-precision quantization method is one of the effective methods for compressing models. Compared with fixed-precision quantization, mixed-precision quantization can give corresponding bit-widths based on layer importance. Model size can be compressed more and accuracy can be improved with appropriate bit-width configurations of each layer. In the previous mixed-precision quantization methods, the search-based method takes a great deal of time for training. Many prior works give activations the same bit as weights or do fixed-precision quantize on activations. However, the activations will also have an optimal bit-width, which is different from the weight. To solve the above problems, we reveal that the gradients can serve as sensitive indicators of a layer, and we design a training method to get the weights bit-width for each layer based on the sensitivity. By using the characteristics of the ReLU activation function, the activations of different layers are assigned different bit-width to achieve better performance. Additionally, the information we use is readily available, and there is little extra burden on the training process. Our experiments on CIFAR-10 and Tiny ImageNet datasets prove that our method can achieve higher performance than prior works.

並列關鍵字

convolutional neural network ； quantization of convolutional neural networks ； mixed-precision quantization ； model compression ； inference acceleration

參考文獻

Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. Scalable methods for 8-bit training of neural networks. Advances in neural information processing systems, 31, 2018.

Google Scholar

Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. Pact: Parameterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085, 2018.

Google Scholar

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019.

Google Scholar

Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.

Google Scholar

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Google Scholar

國際替代計量

基於梯度與零激活值比率的混合精度量化

全文下載

主題瀏覽