基於卷積神經網路預測人臉及姿勢之研究

卷積神經網路(CNN)已經普遍應用在計算機視覺領域，也廣泛用於人臉特徵和人臉姿勢的預測上。為了得到精準預測結果，神經網路層數越來越深，如ResNet最高可達125層，複雜度也隨之提高。另一方面，高準確度模型往往容量過大，容易造成記憶體不足，甚至運行效率不高。在現今行動化裝置日益普及下，對預測模型的要求也從單一的精準度，慢慢轉向還要考慮輕量及高效等面向。人臉特徵由近百個人臉圖像關鍵點座標組成，包含了人臉、眼睛、嘴巴、及鼻子等輪廓。人臉姿勢則是以頭部歐拉角度為基礎，包含搖頭(yaw)、點頭(pitch)、及偏頭(roll)三角度。在過去文獻中，人臉特徵和人臉姿勢都來自不同模型，很少有一個模型能同時預測人臉特徵和人臉姿勢。本文以MobileNet-V2小型網路為前處理架構，萃取圖像特徵，然後接上全連接層神經網路，透過WFLW人臉特徵公開資料集訓練，預測人臉關鍵點座標；同時，將萃取的圖像特徵，接上另一組卷積及全連接層神經網路，透過包含歐拉角度的300-W公開資料集訓練，預測人臉姿勢。經過以上架構的組合及訓練後，本文將其命名為YGNet模型，此模型可達成以一個輕量化神經網路同時預測人臉特徵和姿勢之目的。實作結果顯示，本架構模型容量佔約60MB，並可在每秒60幀刷新率下運作。

關鍵字

卷積神經網路；人臉特徵；人臉姿勢

並列摘要

Abstract: Convolutional neural networks (CNNs) have been widely used in the field of computer vision. For 2D images, it has been popular to use CNNs in the prediction of facial landmarks and pose estimate. To obtain more accurate results, the CNN models have become deeper and deeper. For example, Resnet125 uses 125 layers of convolution networks. As the accuracy is increased, the model is often too large in capacity. It is easy to cause insufficient memory or even inefficient performance. With the increasing popularity of wearable devices, the requirements for prediction models have gradually shifted from precision alone to accommodate other aspects of lightweight and high performance as well for mobile devices. The facial landmarks are composed of 98 key points of the face, including those along the contours of face, eyes, mouth, and the nose. Based on the Euler angle, the head pose includes the three head angles of shaking (yaw), nodding (pitch), and tilting (roll). In past literature, facial landmarks and the head pose mostly come from separate models. Few models can predict both face features and pose at the same time. In this study, a small MobileNet-V2 is used as the main architecture for feature extraction. The extracted features are fed to a fully connected layer to predict the coordinates of facial landmarks. The WFLW open data set of facial landmarks is used for this part of training. In the meantime, the extracted features are also fed to another set of CNN and fully connected layers to predict the head pose. The 300-W open data set which has the Euler angle is used for this part of training. With this framework, a lightweight neural network model called YGNet is composed which can perform two tasks at the same time for detection of facial landmarks and head pose. In our implementation, our model takes up about 60MB in capacity and can operate at a refresh rate of 60 frames per second (fps).

並列關鍵字

Convolutional Neural Networks ； Facial Landmark ； Face Pose

參考文獻

Cao, Z., Chu, Z., Liu, D., Chen, Y. (2020). A Vector-based Representation to Enhance Head Pose Estimation. 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA.

Google Scholar

Cristinacce, D., Cootes, T. (2008). Automatic feature localisation with constrained local models. Pattern Recognition, 41(10), 3054–3067.

Google Scholar

Cao, X., Wei, Y., Wen, F., Sun, J. (2014). Face Alignment by Explicit Shape Regression. International Journal of Computer Vision, 107(2), 177–190.

Google Scholar

Doosti, B., Naha, S., Mirbagheri, M., Crandall, D. (2020). HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

Google Scholar

Dong, X., Yan, Y., Ouyang, W., Yang, Y. (2018). Style Aggregated Network for Facial Landmark Detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, UT, USA.

Google Scholar

國際替代計量

基於卷積神經網路預測人臉及姿勢之研究

全文下載

主題瀏覽