通過過濾器修剪和稀疏張量核心加速CNN模型

卷積神經網絡(CNN)是機器學習中最先進的技術，在許多計算機視覺任務中都實現了高精度。然而，為了提高精度，模型的參數數量正在快速增加；因此，它需要更多的計算時間和內存空間來進行訓練和推理。因此，壓縮模型尺寸和提高推理速度成為一個重要的問題。本文重點介紹過濾器修剪和NVIDIA稀疏張量核心。過濾器剪枝是一種模型壓縮方法，它使用一種方法來評估過濾器在CNN模型中的重要性並刪除不太重要的過濾器。NVIDIA稀疏張量核心是NVIDIAAmpere GPU架構提供的硬件支持。如果矩陣具有表現為2:4模式的結構，則稀疏張量核可以加速矩陣乘法。在本文中，我們提出了一種混合修剪度量來修剪CNN模型。混合剪枝結合了過濾器剪枝和2:4剪枝。我們應用過濾器修剪去除卷積層中的冗餘過濾器，使模型更小。接下來，我們使用2:4修剪根據2:4模式修剪模型，以利用稀疏張量核心硬件進行加速。在這種混合修剪情況下，我們還提出了一個混合排名指標來決定過濾器修剪過程中過濾器的重要性。在混合排名指標中，我們將保留對兩個修剪步驟都很重要的過濾器。通過考慮這兩個指標，我們可以獲得比傳統過濾器修剪更高的準確度。我們使用AlexNet在MNIST、SVHN、CIFAR10數據集上測試我們的混合剪枝算法。從我們的實驗中，我們得出結論，我們的混合排名方法比經典的L1norm度量和輸出L1norm度量實現了更好的準確性。當我們修剪模型中40%的過濾器時，我們的方法的準確度比這三個數據集上的經典L1norm度量和輸出L1norm度量高2.8%、2.9%、2.7%。接下來，我們評估推理速度。我們將混合修剪模型與過濾器修剪或2:4修剪產生的模型進行比較。我們發現混合剪枝模型比具有相同精度的濾波器剪枝模型快1.3倍，比2:4剪枝模型快1.36倍，精度損失僅為1.6%。

關鍵字

模型壓縮；過濾器修剪；卷積神經網絡；機器學習；稀疏張量核心

並列摘要

Convolutional neural network (CNN) is a state-of-the-art technique in machine learning and has achieved high accuracy in many computer vision tasks. However, the number of the parameters of the models is fast increasing for accuracy improvement; therefore, it requires more computation time and memory space for training and inference. Thus, compressing the model size and improving the inference speed has become an important issue. This paper focuses on filter pruning and NVIDIA sparse tensor core. Filter pruning is one of the model compression methods which uses a method that evaluates the importance of filters in the CNN model and removes the less important ones. NVIDIA sparse tensor core is the hardware support provided by NVIDIA Ampere GPU architecture. The sparse tensor core can speed up the matrix multiplication if the matrix has a structure that manifests as a 2:4 pattern. In this paper, we proposed a hybrid pruning metric to prune the CNN model. The hybrid pruning combines filter pruning and 2:4 pruning. We apply filter pruning to remove the redundant filters in convolutional layers to make the model smaller. Next, we use 2:4 pruning to prune the model according to a 2:4 pattern to utilize the sparse tensor core hardware for speedup. In this hybrid pruning situation, we have also proposed a hybrid ranking metric to decide the filter’s importance during filter pruning. In hybrid ranking metric, we will preserve the filters that are important for both of the pruning steps. By considering both metrics, we can achieve higher accuracy than traditional filter prunings. We test our hybrid pruning algorithm on MNIST, SVHN, CIFAR-10 datasets using AlexNet. From our experiments, we concluded that our hybrid ranking method achieves better accuracy than the classic L1-norm metric and the output L1-norm metric. When we prune away 40 percent of filters in the model, our method has 2.8%, 2.9%, 2.7% higher accuracy than the classic L1-norm metric and the output L1-norm metric on these three datasets. Next, we evaluate the inference speed. We compare the hybrid pruning model with the models that result from either filter pruning or 2:4 pruning. We find that a hybrid pruning model is 1.3x faster than the filter pruning model with the same accuracy and is 1.36x speedup than the 2:4 pruning model with only a 1.6% accuracy loss.

並列關鍵字

Model Compression ； Filter Pruning ； CNN ； Machine Learning ； Sparse Tensor Core

參考文獻

C. Buciluǎ, R. Caruana , and A. Niculescu¬Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and datamining, pages 535–541, 2006.

Google Scholar

T.¬W. Chen, P. Liu, and J.¬J. Wu. Exploiting data entropy for neural network com-pression. In 2020 IEEE International Conference on BigData(BigData), pages5007–5016. IEEE, 2020.

Google Scholar

R. Girshick. Fast r¬cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.

Google Scholar

Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional net¬works using vector quantization. arXiv preprint arXiv:1412.6115, 2014.

Google Scholar

C. Guo, B. Y. Hsueh, J. Leng, Y. Qiu, Y. Guan, Z. Wang, X. Jia, X. Li, M. Guo, and Y. Zhu. Accelerating sparse dnn models without hardware¬ support via tile¬wise sparsity. In SC20:International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15. IEEE, 2020.

Google Scholar

國際替代計量

通過過濾器修剪和稀疏張量核心加速CNN模型

全文下載

主題瀏覽