透過您的圖書館登入
IP:3.21.244.137
  • 學位論文

基於混合分布正規化之模型壓縮方法

A Method of Mixture-Distributed Regularization for Model Compression

指導教授 : 吳家麟

摘要


高效能深度學習計算是個重要的課題,它不僅可以節省計算成本,更能將人工智慧實現於行動裝置之中。正規化 (regularization) 是一種常見的模型壓縮方法,而 $L_0$ 範數 (norm) 的正規化是其中一種有效的方法。由於此範數之定義為非零參數的個數,因此相當適合作為神經網路參數稀疏化的約束條件 (sparsity constraints)。然而也因為 $L_0$ 範數的定義使它具離散性並成為數學上棘手的問題。一個早先的研究方法利用 Concrete distribution 來模擬二元邏輯閘,並利用這個邏輯閘的概念來決定哪些參數應該進行剪枝 (pruning)。本論文提出一種更可靠的框架來模擬二元邏輯閘。此框架是一種基於混合分布 (mixture distributions) 建構而成的的正規化項。任何一對對稱且收斂於 $delta(0)$ 與 $delta(1)$ 的分布皆可以在我們提出的框架下成為近似二元的邏輯閘,進而估算 $L_0$ 正規化,達成模型壓縮及網路縮減的目標。此外,我們也推演出一種對混合分布重新參數化 (reparameterization) 的方法至前述的模型壓縮中,使得我們提出的深度學習演算法可以利用隨機梯度下降法進行優化。在 MNIST 與 CIFAR-10/CIFAR-100 資料集訓練下的實驗結果均顯示,我們所提出的方法是非常具有競爭力的。

並列摘要


Efficient deep learning computing has recently received considerable attention. It saves computational costs, and potentially realizes model inference using on-chip devices. Regularization of parameters is a common approach to compress the model. $L_0$ regularization is one of the efficient regularizers since it penalizes the non-zero parameters without any shrinkage of larger values. However, the combinatorial nature of the $L_0$ norm makes it an intractable term. A previous work approximated the $L_0$ norm using the Concrete distribution with emulated binary gates, and collectively determined which weights should be pruned. In this thesis, a more general framework for relaxing binary gates through mixture distributions is proposed. With the proposed method, any mixture pair of distributions converging to $delta(0)$ and $delta(1)$ can be applied to construct smoothed binary gates. We further introduce a reparameterization method for mixture distributions to the field of model compression. Reparameterized smoothed-binary gates drawn from mixture distributions are capable of conducting efficient gradient-based optimization under the proposed deep learning algorithm. Extensive experiments show that we achieve the state-of-the-art in terms of pruned architectures, structured sparsity and the reduced number of floating point operations (FLOPs).

參考文獻


Bibliography
[1] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
[2] Y. Bengio, N. Léonard, and A. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
[3] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized neural networks: Training deep neural networks with weights and activations constrained to +1 or -1. arXiv preprint arXiv:1602.02830, 2016.
[4] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. In International Conference on Learning Representations (ICLR), 2018.

延伸閱讀