高效能頭部姿態估計與深度學習網路設計

在電腦視覺領域中，人體的相關檢測一直有舉足輕重的地位。而「頭部姿態估計」模型，能為人體面部提供重要資訊，更是十分重要。試想對於螢幕廣告的投放者，若能高準確率地判斷消費者人臉對於畫面的視角與專注，對於廣告成效的評估將極有幫助。據我們檢視的文獻，目前現階段成熟的開源項目中，頭部姿態估計的還處於堪用而不好用的發展階段。除了不易達到基本要求之準確率，訓練模型的時間長、資料集標籤繁複，有些甚至要求人臉照片中數十個特徵點，對於GPU的計算量頗大。本研究以開源的3D人臉資料集作訓練，設計一個真正高泛用性的模型，不採用任何特徵點，僅需要人臉姿態角度的資料。為強化網路對於人臉輪廓與五官特徵的捕捉，我們設計了深度學習網路中的「注意力機制」(Vision Attention Mechanism)，這是一個能夠自動學習影像中重要區塊的權重張量，後續透過我們的視覺化熱圖，得以了解張量學習到的重點特徵。而本研究中的特徵擷取網路—「雙層串流網路」，單獨計算卷積層僅有7層，在實驗佐證下，能夠比套用傳統預訓練模型與參數的研究，達到更高的效能。我們也揉合了文獻中的方法，改良分類演算方式為「多折分類法」，讓演算法更貼近人性與智能。

關鍵字

電腦視覺；類神經網路；深度學習；頭部姿態估計；注意力機制

並列摘要

In the field of computer vision, the detection of human body has always been significant, and the "Head Pose Estimation" model, which can provide important information for human face, is very important, too. For the advertisers of on-screen advertisements, it will be very helpful to evaluate the effectiveness of advertisements if the model can estimate the vision of consumers with high accuracy. In open source projects, the head pose estimation is still in the development stage which is bare but not good to use. In addition to the accuracy that is not easy to reach the baseline, the training time of the model is long, and the label of the dataset is too complex. Some even require dozens of feature points in the face images, which requires a lot of calculation for GPU. In our study, open source 3D face dataset is used for training. In order to design a truly general model, only head pose angle data are needed. So as to enhance the extractor of facial features, we implemented the "Vision Attention Mechanism" in the deep learning network, which can automatically learn the weight of important pixel in the image. Besides, the feature extractor, called "Double-layer Streaming Network", has only seven convolution layers. The experimental results show that it can achieve higher efficiency than applying the pre-trained model and weights. We are also inspired by the references, and improved the classification algorithm to "Multi-Layer Classification", so as to make the algorithm closer to human intelligence.

並列關鍵字

Computer Vision ； Neural Network ； Deep Learning ； Head Pose Estimation ； Vision Attention mechanism

參考文獻

[1]. Xiangyu Zhu, Zhen Lei, Xiaoming Liu and Hailin Shi, Stan Z. Li. (2016). Face alignment across large poses: A 3d solution. Proceedings of the IEEE conference on computer vision and pattern recognition.

Google Scholar

[2]. Adrian Bulat and Georgios Tzimiropoulos. (2017). How far are we from solving the 2d 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). Proceedings of the IEEE International Conference on Computer Vision.

Google Scholar

[3]. Rasmus Rothe, Radu Timofte, Luc Van GoolComputer Vision Lab, D-ITET, ETH Zurich, Switzerland. (2015). Dex: Deep expectation of apparent age from a single image. Proceedings of the IEEE international conference on computer vision workshops.

Google Scholar

[4]. Gabriele Fanelli, Juergen Gall and Luc Van Gool. (2011). Real time head pose estimation with random regression forests. CVPR 2011, IEEE.

Google Scholar

[5]. Murphy-Chutorian, E. and M. M. Trivedi. (2008). "Head pose estimation in computer vision: A survey." IEEE transactions on pattern analysis and machine intelligence 31(4): 607-626.

Google Scholar

國際替代計量

高效能頭部姿態估計與深度學習網路設計

未授權

主題瀏覽