透過您的圖書館登入
IP:18.191.102.140
  • 學位論文

結合場景及物件資訊之深度卷積神經網路物件偵測

Deep Convolutional Neural Network with Scene-centric and Object-centric Information for Object Detection

指導教授 : 傅立成
共同指導教授 : 蕭培墉(Pei-Yung Hsiao)

摘要


近年來,深度學習之卷積神經網路於影像物件偵測領域展現了驚人的效果,其 自主於訓練資料中學習的能力使得深層的卷積神經網路架構能表現出優於傳統手 刻特徵(Hand-crafted Features)的特徵擷取能力。然而深層網路所需的運算資源使得 此技術難以應用於自動駕駛系統(Advance Driver Assistance System, ADAS)等需即 時運算的情境中。為了解決運算量的問題,可以透過使用較快速的物件偵測架構, 如 Single Shot Detector 及 YOLO 等來進行加速,然而因為這些方法往往會將輸入 影像縮放至較小的大小以降低整體運算量而造成小物體的特徵較不明顯,因此對 於尺寸較小的物體的偵測準確率都會較差。 本論文針對使用物件資料庫及場景資料庫所訓練的卷積神經網路進行分析, 發現兩者對影像中不同大小的物件有不同的反應,使用場景資料庫所訓練的卷積 神經網路於小物體上能展現出較強的特徵擷取能力。因此本論文提出一個能整合 兩者特徵擷取能力的物件偵測架構,透過訓練影像層級(Image-level)標籤的資料增 加整體的偵測準確率,並補足於小物件上不足的特徵。 為驗證本論文所提出的方法,我們於目前最具挑戰性的 MSCOCO 資料庫及 PASCAL 資料庫進行實驗,實驗結果顯示本論文提出的方法能提升整體的偵測效 果,並於小物體的偵測率獲得大幅的提升。此外,本研究也針對以行車道路場景為 主的 KITTI 資料庫及自行拍攝的道路場景進行實驗分析,以驗證此方法具備不同 目標導向的通用性。

並列摘要


In recent years, Deep Convolutional Neural Network (CNN) has shown an impressive performance on computer vision field. The ability of learning feature representations from large training dataset makes deep CNN outperform traditional approaches with hand-crafted features on object classification and detection. However, computations for deep CNN models are time consuming due to their high complexity, which makes it hardly applicable to real world application, such as Advance Driver Assistance System (ADAS). To reduce the computation complexity, several fast object detection frameworks in the literature have been proposed, such as SSD and YOLO. Although these kind of method can run at real-time, they usually struggle with dealing of small objects due to difficulty of handling smaller input image size. In this thesis, we analyze the CNN trained on both object-centric dataset and scene- centric dataset. And, we find that scene-centric CNN has better localization ability on small objects. Based on this observation, we propose a novel object detection framework which combines the feature representations learned from object-centric and scene-centric datasets with an aim to improve the accuracy on detecting especially small objects. To validate the proposed method, we evaluate our model on MSCOCO dataset, which is the most challenging object detection dataset nowadays. The experimental results show our method can actually improve the accuracy on detection of small objects, which leads to better overall results. We also evaluate our method on PASCAL VOC 2012 and KITTI on-road datasets, and the results show that our method not only can achieve state-of-the-art accuracy on both datasets but also most importantly with real-time speed.

參考文獻


[1] Dalal, N. and B. Triggs, "Histograms of oriented gradients for human detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 886-893.
[2] Lowe, D. G., "Object recognition from local scale-invariant features," in Proceedings of the IEEE International Conference on Computer Vision 1999, vol. 2, pp. 1150-1157.
[3] Girshick, R., J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.
[4] Ren, S., K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," in Advances in neural information processing systems, 2015, pp. 91-99.
[6] Everingham, M., L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, "The pascal visual object classes (voc) challenge," International journal of computer vision, vol. 88, no. 2, pp. 303-338, 2010.

延伸閱讀