利用個人與情境差異改善語音情緒辨識系統

人的因素在情緒感知的過程中扮演非常重要的角色,在心理學研究上,不同的人格特質或者情境下,我們對於情緒的表達方式與感受程度都不盡相同。因此,若我們只用同一種方式量測、量化這些行為, 訓練出辨識率差的模型是可以預期的。特別是在我們利用類神經網路這種以訓練資料為主的模型,這些差異性往往影響得更深。基於這些因素在感知者部分, 對不同標籤者建立子模型來提升辨識效率已經是一個很成熟的研究議題。但在跨情境(文本)與個人情緒辨識模型上,得到的效果就有限。因此我們在此篇研究內分別上提出最大差異迴歸模型(MRD)與多專家語者模型(MoE)兩種方式,分別在跨文本語音辨識以及個人差異化語音情緒辨識兩個情境下,比較於過去提出的方法,兩者都在USC-IEMOCAP和MSP-IMPROV資料庫上得到顯著的改善。另外,在多專家模型上,我們也比對輸出的權重值與預訓練模型的準確率,發現提出的模型確實可以依據不同特質的人給出相對應的權重值給予個別的子模型做最後的預測。總結實驗一與實驗二可以發現除了在情緒感知者的部分,在情境與表達者上考慮不同的因素可以讓模型更加人性化從而得到更好的效果。

關鍵字

語音情緒辨識；跨資料庫語音情緒分析；多語者模型

並列摘要

Human factors play a very important part in the process of emotional perception. In past psychological research, our expression of emotions and feelings are not the same depend in different personalities or situations. Therefore, if we measure and quantify these behaviors in the same way, it is expected to get a model with poor recognition ability. Especially as we use a neural network such a data-driven model, these differences usually have a deep impact on the extracted feature set. Based on these factors, it is a popular topic to build sub-models for different annotators to improve recognition ability in the perceptron part. However, on the cross-corpus and personal emotion recognition parts, the results obtained are limited. Therefore, in this study, we propose two methods of maximum difference regression model (MRD) compared with the related methods in a cross corpus speech recognition scenario. Further in order to improve the SER system, we proposed a multi speaker mixture of experts model (MoE). These proposed methods, both of which have a significant improvement in the USC-IEMOCAP and MSP-IMPROV databases. In addition, on the multi-expert model, we also compare the weight value of the MoE output with recognition results of the pre-training models, and we find that the proposed MoE model can give the corresponding weight value according to different speaker sample. Summarizing Experiment 1 and Experiment 2, we can find that in addition to the emotional perceiver, considering humans’ different factor in the situation (context) and the expression can make the model more humanized further let it more robust.

並列關鍵字

speech emotion recognition ； multi-speaker ； cross corpus SER

參考文獻

• [1] Phillips, Mary L., et al. "Neurobiology of emotion perception I: The neural basis of normal emotion perception." Biological psychiatry 54.5 (2003): 504-514.

Google Scholar

• [2] Masuda, T., Ellsworth, P. C., Mesquita, B., Leu, J., Tanida, S., & Van de Veerdonk, E. (2008). Placing the face in context: Cultural differences in the perception of facial emotion. Journal of Personality and Social Psychology, 94(3), 365-381.

Google Scholar

• [3] Matsumoto, David, et al. "The contribution of individualism vs. collectivism to cross‐national differences in display rules." Asian journal of social psychology 1.2 (1998): 147-165.

Google Scholar

• [4] Gross, J. J., & John, O. P. (2003). Individual differences in two emotion regulation processes: Implications for affect, relationships, and well-being. Journal of Personality and Social Psychology, 85(2), 348-362.

Google Scholar

• [5] Chou, Huang-Cheng, and Chi-Chun Lee. "Every Rating Matters: Joint Learning of Subjective Labels and Individual Annotators for Speech Emotion Classification." ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

Google Scholar

國際替代計量

利用個人與情境差異改善語音情緒辨識系統

全文下載

主題瀏覽