透過您的圖書館登入
IP:18.218.129.100
  • 學位論文

基於點雲及局部關節網路之人體姿態及體型估測

3D Human Pose and Shape Estimation from Point Clouds with Local Joint Network

指導教授 : 簡韶逸
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


三維空間的姿態與體型估測是屬於電腦視覺裡進階的題目,其最終目標為得到人體模型。相較於傳統的方法仰賴模型貼合法,現今的作法利用捲積神經網路來抽取深層特徵來取得人體模型的參數。現今頂尖的方法使用單張彩色影像當作輸入,然而由於渲染器裡的柵格化,三維空間中的幾何特徵是被隱含在其中。我們認為這樣的資料無法正確的帶出三維空間的資訊。因此我們認為使用具有三維空間資訊的資料,也就是深度影像或點雲,會是一個更好的選擇。 在這篇論文裡,我們提出一個兩階段的方法,透過深度影像或對應的點雲預測三維空間人體姿態與體型。這其中我們設計了兩個特殊的模組,都稱為局部關節網路。在第一個階段裡,我們先進行三維空間人體關節預測。我們假設事先可取得二維空間人體關節點來當作初始特徵,我們把這些關節點投影回三維空間形成初始三維空間關節,然後以這些關節為中心去對點雲進行分群,經過分群後的點雲會送進第一個局部關節網路來取的真正的相機空間中三維空間人體關節。在第二個階段裡,我們以前一階段得到的三維空間關節為初始特徵,結合點雲來預測人體模型參數,並使用另外一個局部關節網路來對這些參數進行細部修正。取得參數後,我們變能將它轉換成人體模型。我們驗證我們的方法在我們自己產生的合成資料上,其結果顯示我我們的方法是有效的。而我們論文的最終目標為實作一個擴增實境系統,為此,我們設計了一套系統,結合我們設計的模型,將Kinect v2相機的資料當作輸入,輸出具有擴增實境效果的圖。其結果也顯示我們的方法在實際資料上也是有效的。

並列摘要


3D human pose and shape estimation is an advanced problem in the Computer Vision region. The goal is to retrieve human meshes. While traditional methods mostly rely on model fitting method, modern approaches exploit the potential of Convolutional Neural Network (CNN) to extract deep features and regress the parametric representation of human meshes. Those state-of-the-arts use only an RGB image as input. However, because of the rasterization, the geometric features of 3D space are encoded implicitly in this kind of data. We argue that an RGB image cannot correctly bring out the information of 3D space. Therefore, we suggest that a 3D data, a depth image, or point clouds, would be a better choice. In this thesis, we proposed a two stage method to predict human meshes from depth images or point clouds with two special modules named Local Joint Network (LJN). In the first stage, we predict 3D human joints first. We assume that the 2D joints are provided as initial information. We project those initial joints to 3D space by the depth and use the grouping technique to gather points into clusters according to them. The groups of features are sent to first LJN to predict the real 3D joints in camera coordinate. For the second stage, the 3D joints from the previous step will become initial features. We regress an initial parametric model according to point clouds and the initial features and then refine detailed parameters with another LJN. With the refined parameters, we can recover human meshes. We evaluate our method on a synthetic dataset generated on our own and the experiments show that our two models are effective. The final goal of our study is to achieve an AR system. To this end, we design a flow combining our models' outputs augmented images from Kinect v2 camera. The result also shows that our models work well even on real images.

參考文獻


M. Andriluka, S. Roth, and B. Schiele, “Pictorial structures revisited: People detection and articulated pose estimation,” in 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009, pp. 1014–1021. vi, 4
A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European conference on computer vision. Springer, 2016, pp. 483–499. vi, 5
S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732. vi, 5
J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2640–2649. vi, vii, 1, 6, 7, 26, 27, 31
M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black, “Smpl: A skinned multi-person linear model,” ACM transactions on graphics (TOG), vol. 34, no. 6, pp. 1–16, 2015. vi, 2, 7, 9, 10

延伸閱讀