基於光流與單目相機視覺改良自我運動估測之方法

在本論文中，我們提出了一種基於機器學習的框架，稱其為單目視覺測距法的去耦自我運動估計方法，縮寫為 extit{DCVO}。 DCVO的主要目的是透過輸入由四個模塊組成的體系結構來提高視覺里程計（VO）的預測精準度：光流預測模型，深度預測模型，光流與深度融合模塊以及相機運動預測模型。透過將影片影格的RGB照片、預測的光流圖和深度圖以不同的方式融合在一起，使DCVO能夠學習出有助於相機運動估計的不同特徵。在本論文中，我們探索了多種融合上述特徵的融合策略，並在定性和定量方面進行了比較。為了確定這些特徵對VO預測準確性的影響，我們在本文中考察了四種不同的訓練方法：一種監督式訓練方法，以及三種同時將監督式和非監督式訓練使用的損失函數應用於DCVO框架的其他方法。監督式訓練方法仰賴於與真實數據的差異，而後三種方案除了監督式方法外還包含輔助用的損失函數項。我們在KITTI里程表數據集上進行了廣泛的實驗，並將DCVO與許多代表性基準模型進行比較分析。我們使用KITTI數據集中的訓練用以及測試用的路段影片的測量結果顯示，與其他基準模型相比，我們的方法在使用光流圖和RGB幀的融合配置時可達到最低的錯誤率。為了進一步改善DCVO，我們對不同組件進行了多次組合分析，包括架構和訓練方式的不同組合。分析的結果顯示，使用深度差作為模型的輸入方式相較於使用深度有較好的結果，在未來VO研究中是具有潛力的特徵項目。

關鍵字

視覺里程計；光流；單目相機

並列摘要

In this thesis, we propose a learning based framework called decoupled ego-motion estimation methodology for monocular visual odometry, abbreviated as extit{DCVO}. The primary objective of DCVO is to enhance the prediction accuracy of visual odometry (VO) by introducing an architecture consisting of four modules: a flow estimation module, a depth estimation module, a flow-depth fusion module, as well as a pose estimation module. By allowing the RGB input frames to be fused together with the predicted flow and depth maps in different manners, DCVO enables investigation of different features that contribute to the estimation of camera motion. In this thesis, we explore various fusion strategies for combining the above features, and compare them in qualitative and quantitative perspectives. In order to identify the impacts of these features on the VO prediction accuracy, we inspect four different training schemes in this thesis: one supervised training scheme and three additional schemes that concurrently applying both supervised and unsupervised training loss terms to the DCVO framework. The supervised training scheme relies on the comparison against the ground truth data, while the latter three schemes incorporate auxiliary loss terms in addition to the supervised scheme. We extensively perform experiments on the KITTI Odometry Dataset, and examine the proposed DCVO against a number of representative baseline methods. Our measured results on the training and testing sequences of the KITTI dataset reveal that one of the fusion configuration that concentrates on flow maps and RGB frames leads to the lowest error rates, when compared to the other baselines considered in our experiments. In order to further improve DCVO, we conduct multiple ablation analysis for different components, comprising the architectures as well as the training schemes. The analysis points out that a new delta depth representation is a promising candidate feature to be incorporated in the future VO researches.

並列關鍵字

visualodometry ； opticalflow ； monocularcamera

參考文獻

[1] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from video. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 6612–6619, Jul. 2017.

Google Scholar

[2] T. Shen, Z. Luo, L. Zhou, H. Deng, R. Zhang, T. Fang, and L. Quan. Beyond photometric loss for self-supervised ego-motion estimation. In Proc. IEEE Int. Conf. Robotics and Automation (ICRA), pages 6359–6365, May 2019.

Google Scholar

[3] D. Nist ́er, O. Naroditsky, and J. R. Bergen. Visual odometry. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), volume 1, pages I–I, 2004.

Google Scholar

[4] S. Leutenegger, M. Chli, and R. Siegwart. Brisk: Binary robust invariant scalable keypoints. In Proc. IEEE Int. Conf. Computer Vision (ICCV), pages 2548–2555, 2011.

Google Scholar

[5] M. Calonder, V. Lepetit, M. O ̈zuysal, T. Trzcin ́ski, C. Strecha, and P. Fua. Brief: Computing a local binary descriptor very fast. IEEE Trans. Pattern Analysis and Machine Intelligence (TPAMI), 34:1281–1298, 2012.

Google Scholar

國際替代計量

基於光流與單目相機視覺改良自我運動估測之方法

全文下載

主題瀏覽