  • 學位論文


AST-Net: An Attribute-based Siamese Temporal Network for Real-Time Emotion Recognition

指導教授 : 許秋婷


預測臉部連續並且自發性的情緒變化在電腦視覺領域是一個很重要的研究。 因為了解即時並且細微的情緒變化會對許多人機互動和醫療監控的應用領域有 很大的幫助。在這篇論文裡,我們會著重分析兩個情緒象限,valence 和 arousal 在時間上的動態情緒變化。我們提出了一個基於屬性的雙通道時間網絡,這個網 路包含了一個離散的情緒性卷積網路模型 (discrete emotion CNN model)和一個 堆疊的長短期記憶模型 (Stacked-LSTM)。透過這兩個模型,我們可以有效結合 空間上的臉部特徵資訊和長時間的動態變化進而達到幫助預測的目的。其中,離 散的情緒性卷積網路模型是為了擷取出不受動作和個體特徵變化影響的關於情 緒的特徵;而堆疊的長短期記憶模型則是用於學習沿著時域上的情緒的動態依賴 性。此外,為了穩定訓練過程,並從而得出更平穩可靠的長期預測結果,我們會 同時將兩段在時間上位移過的影片輸入 Siamese (雙通道)網路架構。AVEC2012 的實驗結果顯示,我們提出的方法不僅可以即時預測 (平均每秒預測 40.1 個影 格),也能在只用影像資訊的條件下得到現階段在 AVEC2012 這個資料上最好的 結果。


Predicting continuous facial emotions is essential to many applications in human-computer interaction. In this paper, we focus on predicting the two dimensional emotions: valence and arousal, to interpret the dynamically yet subtly changed facial emotions. We propose an Attribute-based Siamese Temporal Network (AST-Net), which includes a discrete emotion CNN model and a Stacked-LSTM, to incorporate both the spatial facial attributes and the long-term dynamics into the prediction. The discrete emotion CNN model aims to extract attribute-related but pose- and identity-invariant features; and the Stacked-LSTM is used to characterize the dynamic dependency along the temporal domain. Furthermore, in order to stabilize the training procedure and also to derive a smoother and reliable long-term prediction, we propose to jointly learn the model from two temporally-shifted videos under the Siamese network architecture. Experimental results on AVEC2012 dataset show that the proposed AST-Net not only processes in real time (40.1 frames per second) but also achieves the state-of-the-art performance even when using the vision modality alone.


neural network based multimodal dimensional emotion recognition," In Proc. 5th
Pattern Recognition, pages 1836–1845, 2015.
[5] S. Chen and Q. Jin, " Multi-modal dimensional emotion recognition using
Vision and Pattern Recognition, pages 1933–1941, 2016.
Y. Zhou, “Challenges in representation learning: A report on three machine
