透過您的圖書館登入
IP:18.217.144.32
  • 學位論文

利用骨骼動畫及動作變換網路學習單眼三維人體姿態估測

Learning Monocular 3D Human Pose Estimation with Skeletal Animation and Motion Transformer

指導教授 : 賴尚宏

摘要


深度學習在單眼三維人體姿態估計方面取得了前所未有的準確性。然而,目前基於學習的3D人體姿勢估計仍然存在兩類問題,1)泛化能力差,2)模棱兩可的動作投影。當深度網絡遇到訓練集以外的姿勢時,由於有限的訓練數據和高度差異化的野外數據之間的差距,模型的性能很容易下降。受遊戲開發和動畫制作中流行的骨骼動畫技術的啟發,我們提出了一種簡單而有效的技術,即從現有的序列中合成新的三維人體姿勢序列來為模型訓練作數據增強,從而為訓練後的模型帶來強大的泛化能力。我們還提出了一種建立在變換網路編碼器基礎上的新的升維網絡,稱為動作變換網路,它利用強大的自我注意機制來實現從二維到三維的幾何映射,以解決投影模糊的問題。實驗結果表明在未見動作的領域下,通過使用我們在所提出的動作變換網路,配合提出的數據增強方法,我們在公開可用的數據集(如MPI-INF-3DHP)上實現了最先進的泛化精度,從而展現了卓越的三維人體姿勢估計精度。

並列摘要


Deep learning has achieved unprecedented accuracy for monocular 3D human pose estimation. However, current learning-based 3D human pose estimation still suffers two types of problems, 1) poor generalization, 2) projection ambiguity. When deep network encounter poses out of the training domain, model performance is prone to degrade due to the gap between limited training data and highly variant in-the-wild data. Inspired by skeletal animation, which is popular in game development and animation production, we propose a simple yet effective technique to synthesize new 3D human pose sequences from existing sequences as augmented data and thus bring strong generalization to the resulting model. We also put forward a new lifting network built upon transformer encoder, termed Motion Transformer, which utilizes the powerful self-attention mechanism to function the geometric mapping from 2D to 3D for solving pose ambiguity. Experimental results on the unseen domain demonstrate superior 3D human pose estimation accuracy by using our data augmentation method on the proposed Motion Transformer, where we achieve state-of-the-art generalization accuracy on publicly available datasets such as MPI-INF-3DHP.

參考文獻


1. [1] Akhter, I., and Black, M. J. Pose­conditioned joint angle limits for 3d human pose reconstruction. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015), 1446–1455.
2. [2] Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. 2d human pose esti­ mation: New benchmark and state of the art analysis. 2014 IEEE Conference on Computer Vision and Pattern Recognition (2014), 3686–3693.
3. [3] Ba,J.,Kiros,J.,andHinton,G.E.Layer normalization.ArXivabs/1607.06450 (2016).
4. [4] Carion,N.,Massa,F.,Synnaeve,G.,Usunier,N.,Kirillov,A.,andZagoruyko, S. End­to­end object detection with transformers. In European Conference on Computer Vision (2020), Springer, pp. 213–229.
5. [5] Catalin Ionescu, Fuxin Li, C. S. Latent structured models for human pose estimation. In International Conference on Computer Vision (2011).

延伸閱讀