透過您的圖書館登入
IP:3.145.60.149
  • 學位論文

應用貝氏網路及適應性調適方法於語音情緒辨識之研究

Speech Emotion Recognition Using Bayesian Network and Adaptive Approach Methods

指導教授 : 劉佩玲
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


本研究之主要目的在發展一貝氏網路自動化語音情緒辨識方法,透過情緒語音之相關特徵參數計算,並與資料庫中各情緒之資料作比對,將語者的情緒狀態從語音訊號中辨識出來。 首先,將語者之情緒語音訊號以統計方式計算音高(Pitch)、音框能量(Frame energy)、共振峰(Formants)以及梅爾倒頻譜係數(Mel-scale Frequency Cepstral Coefficients, MFCC)等相關之語音情緒特徵,並以各特徵參數在中性情緒下之資料庫平均值為正規化特徵參數因子,將音高、音框能量以及共振峰利用正規化特徵參數因子進行正規化,得到正規化後之特徵參數,以消除不同語者之間差異。 各特徵參數對情緒之辨識能力不同,如音高可大約分辨悲傷及中性,快樂及生氣會被視為同一群。由於沒有任一參數可明顯分辨四種情緒,故本研究採用分層解析的方式,先將特徵參數分群,同一群特徵參數具有相似的情緒分類效果,並以分群結果建立多層貝氏網路(Multi-Layered Bayesian Network, MLBN),第一層的輸入參數都只能辨識兩群情緒,第二層的輸入參數可辨識三群情緒,較無明顯分群效果的參數則置於第三層,以辨識四種情緒。 由於特徵參數之間具有相關性,因此,本論文將MLBN延伸,將特徵參數之相關性納入考量並提出多層共變異數貝氏網路(Multi-Layered Bayesian Network with Covariance, MLBNC)。 當語者資料不在訓練資料中時,其辨識效果通常不佳。為改善此狀況,本研究提出適應性MLBN及適應性MLBNC語音情緒辨識調適方法,在調適過程中,當辨識結果與語者情緒不同時,則根據語者情緒語音所得之參數值來調整MLBN及MLBNC資料庫中各群之平均值與標準差或共變異數,以符合語者真實的情緒狀態。 為驗證本研究所提出之方法,我們使用德國情緒語料庫EMO-DB當作訓練與測試語料,並以KNN、SVM、MLBN以及MLBNC分別進行Inside與Outside Test。同時,我們亦以EMO-DB為訓練語料,並以工業技術研究院所自行錄製之情緒語料為測試語料,分別對KNN、SVM、MLBN以及MLBNC進行不同語系之測試。而在適應性MLBN及MLBNC的驗證上,我們以EMO-DB為訓練語料,並以工研院之情緒語料為調適與測試語料,分別對適應性KNN、MLBN及MLBNC進行調適前後的測試。 實驗結果顯示,本研究提出之MLBN、MLBNC與單純使用貝氏決策之Inside Test辨識率分別為81.1%、88.8%以及70.8%,顯示透過各參數分層分群的辨識方式,可以有效提高語音情緒之辨識率,而考慮特徵參數間相關性之MLBNC,其結果亦優於MLBN。在Outside Test的部分,KNN、SVM以及MLBN使用原始參數時,其辨識率分別為78.2%、89.1%以及69.9%,而使用正規化參數時,辨識率分別為82.6%、91.7%以及77.6%,顯示正規化特徵參數可以有效縮小語者之間在特徵參數上的差異。而當訓練與測試語料為不同語系時,KNN、SVM、MLBN以及MLBNC之辨識率分別為34.21%、46.92%、39.33%以及52.08%,此結果顯示,當發音方式或表達情緒方式與資料庫有所差異時,各分類器之辨識效果均不佳。 在調適實驗部分,由調適前後之辨識結果顯示,KNN經調適過後,辨識率從34.2%提升至73.7%,而MLBN及MLBNC經調適後,其辨識率分別從37.8%提升至82.4%以及51.6%提升至81.2%,本研究提出之適應性MLBN與適應性MLBNC語音情緒辨識方法於資料庫修正後,其辨識效果明顯優於適應性KNN。而當調適次數增加時,MLBN及MLBNC經調適後,其辨識率則分別從39.3%提升至88.9%以及52.1%提升至90.0%,顯示經由本論文所提出之調適方法,經調適後,確實可以真正的反映語者的真實狀況,並得到良好的調適後辨識結果。

並列摘要


The objective of this study is to develop an automatic speech emotion recognition method using Bayesian Network. By calculating the relevant features of emotion speech and comparing the features with emotion database, the speaker’s emotion state can be identified. Firstly, we calculate the statistical features of pitch, frame energy, formants, mel-scale frequency cepstral coefficients (MFCC). Then we use the mean value of neutral emotion in corpus as normalized factor for each feature, and calculate the normalized features of pitch, frame energy and formants. The normalized features can reduce the feature difference between speakers. Each feature has different ability of emotion recognition. For example, the normalized pitch mean can recognize sad and neutral, and happy and angry can consider as the same cluster. No features can obviously recognize the four emotions, so we use different cluster to recognize the four emotions layer by layer. We cluster the features which have similar ability of emotion recognition and establish the Multi-Layered Bayesian Network (MLBN) method for speech emotion recognition. The features of layer 1 can recognize two clusters of emotion. The features of layer 2 can recognize three clusters of emotion. The features which have no obvious clusters are put on layer 3 and recognize the four emotions. There are some relations between each feature. Therefore, we extend the MLBN method and establish the Multi-Layered Bayesian Network with Covariance (MLBNC) method, which consider the relations between each feature, for speech emotion recognition. The recognition rate will be poor if the training data of recognizer did not contain speaker’s speech emotion data. Therefore, we propose adaptive MLBN and MLBNC method for speech emotion recognition. In the adaptive MLBN and MLBNC process, we adjust the mean and standard deviation or covariance of clusters in the MLBN or MLBNC database to fit speaker’s real emotion status when the recognition result is wrong. To verify the proposed method in this research, we use German emotional database (EMO-DB) as training and testing data for inside and outside test of KNN, SVM, MLBN and MLBNC recognizer. We also use EMO-DB as training data and ITRI emotional database as testing data for different corpus test. In the adaptive tests, we use EMO-DB as training data and ITRI emotional database as adaptive and testing data for adaptive KNN, MLBN and MLBNC recognizer. The inside test recognition rate of MLBN, MLBNC and Bayesian Decision (BD) are 81.1%, 88.8% and 70.8% respectively. It shows that cluster of features layer by layer can effectively increase the recognition rate and it will be better when regards of the relations between each feature. In outside test, the recognition rate of KNN, SVM and MLBN are 78.2%, 89.1% and 69.9% respectively using original features and 82.6%, 91.7% and 77.6% respectively using normalized features. It shows that normalized features can reduce the feature difference between speakers and increase the recognition rate. In testing corpus is different with training, the recognition rate of KNN, SVM, MLBN and MLBNC are 34.21%, 46.92%, 39.33% and 52.08% respectively. It shows if speaker’s pronunciation or emotion presentation is different with training data, the recognition result is bad for each recognizer. For adaptive emotion recognition test, adaptive KNN method can increase the recognition rate from 34.2% to 73.7%, adaptive MLBN method can increase from 37.8% to 82.4% and adaptive MLBNC method can increase from 51.6% to 81.2%. The proposed adaptive MLBN and MLBNC method of this study is better than adaptive KNN method. When adjustment times increase, the recognition rate of MLBN can increase from 39.3% to 88.9% and MLBNC can increase from 52.1% to 90.0%. It shows that adaptive MLBN and MLBNC method can really reflect the real status of speaker’s emotion state and get good recognition results after appropriate adjustment.

並列關鍵字

speech emotion recognition features normalization MLBN MLBNC adaptive

參考文獻


1.M. Álvarez, R. Galán, F. Matía, D. Rodríguez-Losada, A. Jiménez, “An Emotional Model for a Guide Robot,” IEEE Trans., Systems, Man and Cybernetics, Part A: Systems and Humans, Vol. 40, No. 5, pp. 982-992, 2010.
2.J. Ang, R. Dhillon, A. Krupski, E. Shriberg, and A. Stolcke, “Prosody-based automatic detection of annoyance and frustration in human-computer dialog,” in Proc. of the International Conference on Spoken Language Processing (ICSLP), pp. 2037-2040, 2002.
3.R. Banse and K. R. Scherer, “Acoustic profiles in vocal emotion expression,” Journal of Personality and Social Psychology, Vol. 70, pp. 614-6362, 1996.
5.C. Busso, S. Lee, and S. Narayanan, "Analysis of Emotionally Salient Aspects of Fundamental Frequency for Emotion Detection," IEEE Transactions on Audio, Speech, and Language Processing, Vol. 17, pp. 582-596, 2009.
6.S. Casale, A. Russo, G. Scebba, and S. Serrano, "Speech Emotion Classification Using Machine Learning Algorithms," IEEE International Conference on Semantic Computing, pp. 158-165, 2008.

被引用紀錄


張家焮(2005)。應力波於鋼管受疲勞載重之安全評估〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2005.03069

延伸閱讀