頭戴式VR的類神經網路手部語意切割系統設計

本論文以HTC VIVE作為使用平台，利用深度學習類神經網路的技術，建立了一個具有實務應用可行性的手部影像切割系統。論文研究所完成的系統，除了可以精確的從動態影像中把使用者的手部區域切割出來，也可以在每秒30張影像的處理速度下運行即時的人機互動應用。在研究過程中，我們首先對虛擬實境情境下的使用者手掌尺度進行分析，並且依此設計製作訓練資料和設定資料擴增的策略。在類神經網路架構方面，我們以公開資料集與一部分我們的自製資料，評估各個網路架構的泛化能力。我們也實驗了許多訓練策略，找到最適合我們應用的訓練順序，最後所訓練出來的辨識模型，雖然不能達到即時計算的要求，但是在準確度方面卻可以達到相當好的水準（灰階影像平均切割正確率達96.93%，彩色影像平均正確率可達97.47%）。在完成高精確度類神經網路手部切割模型之後，為了達到即時計算的目標，我們將其網路架構分為兩個部分進行加速。首先，將網路前半段替換為複雜度較低的網路、並利用知識萃取的技巧來維持精確度。接著，再利用快速卷積計算加速網路後半段，使得我們的網路除了可以達到好的精準度（灰階96.30%、彩色97.30%）更可以做到即時（30.21 fps）運行的目標。最後，我們也嘗試加入資料的時序關係來改進效能，成功地再提升了0.17%的精準度。

關鍵字

語意切割；影像切割；類神經網路；深度學習

並列摘要

This thesis aims to design a hand semantic segmentation system for virtual reality applications using the HTC VIVE devices. The result of the thesis research is a real-time hand segmentation system that segments all the hands in a dynamic scene with high accuracy. At the early stage of our work, we analyze the scenario of the target application by estimating the hand scale distribution. Based on the distribution, we designed our own dataset and set the data augmentation strategy. We also survey some neural network architecture based on their generalization ability on our own dataset and some open source dataset. In order to find the best fit for our application and data, we conduct experiments on different training strategies. As a result, we achieved a good segmentation accuracy (grayscale image 96.93%; RGB image 97.47%) using a complex CNN model that cannot perform inference in real time. In order to achieve real-time performance, we split the proposed network structure into two parts and accelerate each part with different methods. The complexity of the first part is reduced using knowledge distilling schemes while the second part is accelerated by separable convolutions. We successfully implemented a real-time (30.21 fps) system which has good performance in accuracy (grayscale image 96.30%; RGB image 97.30%). At the end of this thesis, we also combine time serial information into the proposed system and successfully boost the accuracy up by extra 0.17%.

並列關鍵字

Semantic segmentation ； Image segmentation ； Neural network ； deep learning

參考文獻

[1] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings of IEEE conference on computer vision and pattern recognition, Las Vegas, United States, pp.770-778, 2016.

Google Scholar

[2] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context,” Int. J. Comput. Vis., vol. 81, pp. 2–23, 2009

Google Scholar

[3] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2008, pp. 1–8.

Google Scholar

[4] B. Fulkerson, A. Vedaldi, and S. Soatto, “Class segmentation and object localization with superpixel neighborhoods,” in Proc. IEEE 12th Int. Conf. Comput. Vis., 2009, pp. 670–677

Google Scholar

[5] Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431-3440).

Google Scholar

國際替代計量

頭戴式VR的類神經網路手部語意切割系統設計

全文下載

主題瀏覽