語音情緒辨識近年來為深度學習所賜,成果越來越斐然,然而情緒構成複雜度造成的情緒資料庫蒐集問題仍然存在:語音情緒資料難以快速累積,以及在不同語境間變異度高。初始化微調是深度學習中一個常見的解決方法,然而若是純粹以多媒體背景資料則會與語音情緒辨識存在太大的差異,還是需要不論是在初始化時給予情緒的引導,或是微調時提供更準確的方法。因此本論文提出利用大量隨手可得的多媒體資料,伴隨從其聲音及文字資訊產生的情緒激發、向性代理標記,學習給語音情緒辨識問題應用的初始化語音前端網路;接著以此初始化網路取向的取樣方法輔助微調,建立目標資料庫的語音情緒辨識模型。結果顯示在語音前端網路的及取樣方法輔助下,結果都可以勝過隨機初始化,有卓越提升的表現。
The rapid development of deep learning technology bring benefit to progression of speech emotion recognition (SER), though the complexity of emotion still exists to cause problems of the difficulties in rapidly obtaining large scale annotated data and hardly handled high variability across different domains. The initialization - fine-tuning strategy is a common solution in deep learning research. However, simply applying abundant media can still has high discrepancy between it and SER problem. An emotion guidance introduces would help solving it. In this work, we propose to learn an initialization speech front-end network on a large-scaled media data collected in-the-wild jointly with proxy arousal-valence labels that are multimodally derived from audio and text information; and then, to build the SER prediction model by fine-tuning with the assistant of initialization-oriented sampling method. The result shows that the integration of both speech front-end network and sampling method can achieve better performance than random initialization.