利用解耦前景與背景特徵壓縮單階段物件偵測網路

知識蒸餾(Knowledge distillation)是一個壓縮卷積網路的熱門方法，其主要方法是在訓練一個輕型網路的過程中，利用一個已經訓練好，擁有一定表現且大型的網路，利用大型網路的預測結果、特徵圖等資訊，去引導輕型網路學習。這樣作法所得的輕型網路效能比僅使用一般訓練的輕型網路，在同樣的計算需求下，表現得更加優秀。我們通常稱引導學習的大型網路為老師網路，被引導的輕型網路為學生網路。在大部分知識蒸餾的論文中，皆是以KL divergence判斷老師網路與學生網路對於所預測分類結果的機率分布差異程度，如果差異程度低，就認為學生網路與老師網路非常相似，有達到知識蒸餾的目的。對此我們發現了即便兩者預測的機率分布差異很低，但是彼此特徵向量的方向卻差距甚大的情況。在本篇論文我們提出利用Cosine-similarity，以擬合兩者的前景特徵方向，藉此利用不同面向來確保兩者網路一致性，以及在知識濃縮的過程中加入自適應學習，透過分析老師網路的預測結果是否優於學生網路，以決定學生網路是否該接受老師網路的引導。

關鍵字

卷積神經網路；深度學習；知識蒸餾；影像分類；物件偵測

並列摘要

Knowledge distillation is a popular method for compressing convolutional networks. The main idea is to use a well-performing large-scale trained network to guide a small-scale network during training. By transferring the knowledge of the feature maps, prediction results, and other information of the large-scale network, the small-scale network can learn better. Therefore, under the same computational demands, performance of the small-scale network trained with knowledge distillation is better than that of the small-scale network trained without. We usually call the above-mentioned large-scale networks and small-scale networks as teacher networks and student networks. In recent work of knowledge distillation in classification, KL divergence of predicted probabilities between teacher and student is commonly used for measuring the difference of the prediction result. Thus, KL divergence loss can be a training guide of the student network. The smaller the better since it implies that the student network behaves similar to that of the teacher network. However, we found that it is possible that two feature vectors with very different directions but, after Softmax, they have a very small KL divergence loss. In this paper, we propose to use Cosine-similarity loss to encourage the similar direction on foreground feature vectors and KL divergence loss to constraint the similar prediction results on background for teacher model and student model. We also propose an adaptive learning strategy that student learns from teacher only when teacher performs better than student does.

並列關鍵字

Convolutional Neural Networks ； Deep Learning ； Knowledge Distillation ； Image Classification ； Object Detection

參考文獻

[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, In NIPS, 2015.

Google Scholar

[2] Wei Liu, Dragomir Anguelov, Dumitru Erhan. Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg, SSD: Single Shot MultiBox Detector, In ECCV 2016.

Google Scholar

[3] Joseph Redmon and Ali Farhadi. Yolov3: An Incremental Improvement. Tech Report, 2018. arXiv:1804.02767 [cs.CV]

Google Scholar

[4] Alexey Bochkovskiy, Chien-Yao Wang, Hong-Yuan Mark Liao. YOLOv4: Optimal Speed and Accuracy of Object Detection. Tech Report, 2020. arXiv:2004.10934[cv.CV]

Google Scholar

[5] Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, Jian Sun. YOLOX: Exceeding YOLO Series in 2021, In CVPR, 2022.

Google Scholar

國際替代計量

利用解耦前景與背景特徵壓縮單階段物件偵測網路

全文下載

主題瀏覽