利用多媒體資料建構的語音前端網路觀察情緒辨識重要資料

語音情緒辨識近年來為深度學習所賜，成果越來越斐然，然而情緒構成複雜度造成的情緒資料庫蒐集問題仍然存在：語音情緒資料難以快速累積，以及在不同語境間變異度高。初始化微調是深度學習中一個常見的解決方法，然而若是純粹以多媒體背景資料則會與語音情緒辨識存在太大的差異，還是需要不論是在初始化時給予情緒的引導，或是微調時提供更準確的方法。因此本論文提出利用大量隨手可得的多媒體資料，伴隨從其聲音及文字資訊產生的情緒激發、向性代理標記，學習給語音情緒辨識問題應用的初始化語音前端網路；接著以此初始化網路取向的取樣方法輔助微調，建立目標資料庫的語音情緒辨識模型。結果顯示在語音前端網路的及取樣方法輔助下，結果都可以勝過隨機初始化，有卓越提升的表現。

關鍵字

語音情緒辨識；卷積神經網路；語音前端網路；初始化微調

並列摘要

The rapid development of deep learning technology bring benefit to progression of speech emotion recognition (SER), though the complexity of emotion still exists to cause problems of the difficulties in rapidly obtaining large scale annotated data and hardly handled high variability across different domains. The initialization - fine-tuning strategy is a common solution in deep learning research. However, simply applying abundant media can still has high discrepancy between it and SER problem. An emotion guidance introduces would help solving it. In this work, we propose to learn an initialization speech front-end network on a large-scaled media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information; and then, to build the SER prediction model by fine-tuning with the assistant of initialization-oriented sampling method. The result shows that the integration of both speech front-end network and sampling method can achieve better performance than random initialization.

並列關鍵字

speech emotion recognition ； convolutional neural network ； speech front-end network ； initialization fine-tuning

參考文獻

[1] C.Lisetti and C.LeRouge, “Affective computing in tele-home health,” in Proceedings of the 37th Annual Hawaii International Conference on System Sciences, 2004, pp. 8 pp.-.

Google Scholar

[2] A.Luneski, P. D.Bamidis, and M.Hitoglou-Antoniadou, “Affective computing and medical informatics: state of the art in emotion-aware medical applications.,” Studies in health technology and informatics, vol. 136, p. 517, 2008.

Google Scholar

[3] K.L.B and L. P.GG, “Student Emotion Recognition System (SERS) for e-learning Improvement Based on Learner Concentration Metric,” Procedia Computer Science, vol. 85, pp. 767–776, 2016.

Google Scholar

[4] K. C.Lin, T.Huang, J. C.Hung, N. Y.Yen, and S. J.Chen, “Facial emotion recognition towards affective computing‐based learning,” Library Hi Tech, vol. 31, no. 2, pp. 294–307, 2013.

Google Scholar

[5] Q.Luo, “Emotion Recognition in Modern Distant Education System by Using Neural Networks and SVM,” in Applied Computing, Computer Science, and Advanced Communication, Springer, 2009, pp. 240–247.

Google Scholar

國際替代計量

利用多媒體資料建構的語音前端網路觀察情緒辨識重要資料

全文下載

主題瀏覽