基於神經網路並應用於虛擬實境之第一人稱手部參數估計

手部姿勢估計在現今的電腦視覺領域中一直是一項十分熱門的研究項目。對於人機交互的應用虛擬現實（VR），增強現實（AR）和混合現實（MR）等，可以從圖像精確且穩健地估計手關節坐標的手部姿勢估計系統是必不可少的。然而，現在手姿勢估計仍然存在一些限制。首先，現有數據集幾乎是以第三人稱視圖組成之數據集，因此將難以在第一人稱視圖系統中被應用因為攝像頭將頭戴於VR頭盔上。其次，大多數現有方法依賴於預先框取好之手部，而這在應用中並不理想，特別是當需要在不同系統中實現時，因為程式語言可能不同，而使用單一深度學習模型即可克服。本論文的目的是開發一種系統，該系統可以僅使用單個RGB幀來估計手的位置和姿勢參數，從而能夠為用戶提供操作虛擬物件的界面。在本文中，我們提出了一個基於深度學習的網絡來實現這一目標，通過我們自己生成的龐大數據集進行端到端的訓練。我們使用不同顏色的手部皮膚、各種光源、各種姿勢和位置之手模型，並將從COCO數據集中隨機獲取的圖像作為背景使用，利用3D引擎Unity做渲染。然後，我們訓練卷積神經網絡，不僅估計圖像中手部的位置，還估計手部模型的相應姿勢參數，以獲得3D關節坐標。在實驗部分，我們將首先對不同配置的模型訓練進行一些比較，以證明我們提出的方法可以提高性能。其次，我們將提出的方法與其他一些最先進的方法進行比較，以顯示我們的優異表現。我們期望我們提出的手部參數估計系統可以為用戶提供與虛擬世界交互的舒適體驗。

關鍵字

手姿態估計；虛擬現實；卷積神經網絡； Unity ；合成數據

並列摘要

Hand pose estimation has been a popular research topic in the area of computer vision. An accurate and robust hand pose estimation system which can estimate the coordinates of the hand joints from images is essential when it comes to applications of human computer interaction, such as virtual reality (VR), augmented reality (AR), and mixed reality (MR). However, there are still some constraints in hand pose estimation nowadays. First, almost all the existing datasets are third person view datasets, which are hard to be implemented in a first person view VR system since the camera would be head mounted on the VR headset. Second, most of the existing methods rely on a preprocessing procedure of capturing the bounding box of a hand, which is not ideal in realistic applications especially when it needs to be implemented in different systems since the programming language, while it would be portable if only a single model is used. The purpose of this thesis is to develop a system which can estimate the locations and the pose parameters of hands to reconstruct them using only a single RGB image frame, being able to provide a natural interface for users to manipulate objects in virtual world. In this thesis, we propose a deep learning based network to achieve this goal by training it end-to-end with a huge dataset collected by ourselves. We render hand models with multi-colored hand skins, various light sources, and combine them with images, which are randomly captured from COCO dataset, as backgrounds using the 3D engine, Unity, placing the hand models in various poses and locations. Then, we train the convolutional neural network (CNN) to estimate not only the locations of hands in an image but also their corresponding 3D coordinates and their classification of left or right handed. In the experiment part, we will first make some comparison between models training in different configuration to prove that our proposed method can improve the performance. Second, we will have our proposed method compared with some other state-of-the-art for showing the outperformance of ours. We expect that our proposed hand parameter estimation system can provide a comfortable experience for users to interact with the virtual world.

並列關鍵字

Hand pose estimation ； Virtual reality ； Convolutional neural network ； Unity ； Synthetic data

參考文獻

[1] K. He, G. Gkioxari, P. Dollár, and R. Girshick, "Mask r-cnn," in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961-2969.

Google Scholar

[2] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2117-2125.

Google Scholar

[3] J. Redmon and A. Farhadi, "YOLO9000: better, faster, stronger," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263-7271.

Google Scholar

[4] W. Liu et al., "Ssd: Single shot multibox detector," in European conference on computer vision, 2016, pp. 21-37: Springer.

Google Scholar

[5] Y. Wu, J. Lin, and T. S. Huang, "Analyzing and capturing articulated hand motion in image sequences," IEEE transactions on pattern analysis and machine intelligence, vol. 27, no. 12, pp. 1910-1922, 2005.

Google Scholar

國際替代計量

基於神經網路並應用於虛擬實境之第一人稱手部參數估計

全文下載

主題瀏覽