透過您的圖書館登入
IP:18.226.104.177
  • 學位論文

應用於三維道路物件偵測之體素與像素融合網路

Voxel-Pixel Fusion Network for 3D On-road Object Detection

指導教授 : 傅立成
共同指導教授 : 蕭培墉(Pei-Yung Hsiao)
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


傳統上,汽車工業為硬體導向產業,近年來,深度學習爆炸性的成長,將電腦視覺推向嶄新的境界。因此,自動駕駛的可能性不再是不可觸及的夢想。其中自動駕駛的第一環節即是感知環境,常見的感測器如光達、攝影機等,利用各種感測器的提供的資訊偵測出道路上的物件。然而各種感測器皆有其優勢及劣勢,因此多感測器融合是一個有效提升偵測結果的方式。本論文旨在提出一種深度學習方法,整合光達點雲與攝影機影像特徵並且偵測出三維道路物件。 本論文提出一種創新且基於融合的三維物件偵測網路,稱作「體素與像素融合網路」。位於其中的「體素與像素融合層」包含了「參數特徵生成」、「基於參數的權重調整」和「體素與像素融合」,共三個模組,能根據幾何關係雙向地融合一對體素與像素的特徵。此外,我們提出的「體素與像素參數」能考慮每一對體素與像素的特性,並加強融合效果。 本研究使用知名的KITTI資料集訓練模型,並於官方未公開標註的測試集評估。實驗結果顯示本方法在多類別與多難度的三維物件偵測平均精確度達到65.99%。值得注意的是,我們的方法在KITTI排行榜中具有挑戰性的行人類別位居第一名。

並列摘要


Traditionally, the automotive industry is hardware-oriented. In recent years, deep learning technology advances rapidly, driving the computer vision field into a new realm. Thus, autonomous driving is no longer an unattainable dream. The first step in autonomous driving is using the information provided by sensors, such as LiDARs, cameras, and so on, to detect on-road objects. However, each sensor has its strengths and weaknesses. Therefore, multi-sensor fusion is an effective way to improve detection. This thesis presents a deep learning approach that integrates features of LiDAR point cloud and camera RGB image and detects 3D on-road objects. This thesis proposes a novel fusion-based 3D object detection network called Voxel-Pixel Fusion Network (VPFNet). According to the geometric relation of a voxel-pixel pair, the proposed voxel-pixel fusion layer, including parameter feature generating, parameter-based weighting, and voxel-pixel fusion modules, bidirectionally fuses voxel’s and pixel’s features. Specifically, voxel-pixel pair parameters, which consider the characteristics of each voxel-pixel pair, are proposed to enhance the fusion effect. The model is trained on the well-known KITTI dataset and evaluated on the official testing set whose labels are unreleased. The experimental results show that the performance of this method reaches 65.99% in mean average precision (mAP) for the multi-class 3D object detection task under multi-level difficulty. Noteworthy, our approach ranks first on the KITTI leaderboard for the challenging pedestrian class.

參考文獻


[1] Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles, 2021.
[2] D. G. Lowe, "Object recognition from local scale-invariant features," in Proceedings of the Seventh IEEE International Conference on Computer Vision, 1999.
[3] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), 2005.
[4] M. D. Zeiler and R. Fergus, "Visualizing and Understanding Convolutional Networks," in Computer Vision – ECCV 2014, Cham, 2014.
[5] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE, vol. 86, pp. 2278-2324, 1998.

延伸閱讀