透過您的圖書館登入
IP:3.144.154.208
  • 學位論文

非監督式學習單眼視覺深度、光流場及相機運動之估測

Joint Unsupervised Learning of Multi-frame Depth, Optical Flow and Ego-motion by Watching Videos

指導教授 : 莊仁輝 陳華總

摘要


學習一個場景中的3D幾何資訊,例如:場景深度、光流場等,可以幫助機器人達到環境感知及障礙物閃避等功能。近年來有很多研究開發監督式學習的神經網路模型,利用大量的資料來學習影像中的3D幾何資訊。然而,這些方法非常訓練資料的收集以及訓練資料本身的正確性來達到準確的估測,但是這些方法只能訓練在已有的資料庫上,並無法準確估測訓練資料以外的場景。有鑑於此,本論文專注於開發非監督式學習的神經網路模型,只透過連續單眼影像來學習光流、場景深度以及相機運動之預測。 我們透過將原影像與模型估測深度做inverse warping合成目標影像,利用合成的目標影像與真實目標影像的差值來訓練我們的模型。基於此方法,我們進一步將連續三張單眼影像中任兩張圖片的所有排列組合當作我們訓練網路的資訊,實驗顯示,我們的方法可以有效提升場景深度預測的準確度。此外,因為場景中移動物體會導致遮蔽問題,嚴重影響預測的準確度,因此我們利用光流來預測遮蔽的區域,進而幫助訓練我們的深度模型,並大幅改善遮蔽問題。 透過在KITTI資料集的訓練,我們的單眼影像深度預測能優於其它非監督式學習場景深度的方法。此外,我們也利用AirSim收集空拍機視角的影像訓練我們提出的模型,並且展示本方法在空拍機視角上也能達到不錯的深度及光流預測。

並列摘要


Learning 3D geometry in a scene such as depth and optical flow can benefit robotics to perceive the environment and avoid obstacles. In recent years, many researchers develop deep neural networks and use supervised learning to learn 3D geometry in a scene. However, to achieve high performance, those methods require plenty of well-labelled training data, which is a big limitation for supervised learning methods, since they may not be generalized to work for outside the dataset. Consequently, we focus on developing an unsupervised learning system that can train deep neural networks to estimate optical flow, depth and ego-motion estimation with only single-view image sequences as inputs. Accordingly, we exploit inverse warping technique to synthesize the target image using the predicted depth map and the source image, and use the difference between the true target image and the synthesized image to guide our training procedure. Based on this idea, we further use all permutations of image pairs in an image sequence with three frames to train our model. Besides, we introduce the soft occlusion maps estimated by optical flow to our networks to tackle the occlusion problem in the estimation of optical flow, depth and camera ego-motion. Experimental results show our approach can surpass previous works in monocular depth prediction for KITTI dataset. Also, to verify the generalizability of our model, we train our model on a drone-view dataset collected by AirSim, and demonstrate our model can perform reasonably well on various camera poses and altitudes.

參考文獻


[1] B. K. Horn and B. G. Schunck, “Determining optical flow,” Artificial intelligence, vol. 17, no. 1-3, pp. 185–203, 1981.
[2] J. Janai, F. Gu ̈ney, A. Behl, and A. Geiger, “Computer vision for autonomous vehicles: Problems, datasets and state-of-the-art,” arXiv preprint arXiv:1704.05519, 2017.
[3] N. Bonneel, J. Tompkin, K. Sunkavalli, D. Sun, S. Paris, and H. Pfister, “Blind video temporal consistency,” ACM Transactions on Graphics (TOG), vol. 34, no. 6, p. 196, 2015.
[4] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4040–4048.
[5] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and ego-motion from video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1851–1858.

延伸閱讀