透過您的圖書館登入
IP:3.145.23.123
  • 學位論文

中文語音情緒辨識及效能評估之研究

A STUDY OF EMOTION RECOGNITION ON MANDARIN SPEECH AND ITS PERFORMANCE EVALUATION

指導教授 : 包蒼龍
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


常言道「科技始終來自於人性」,科技的終極發展必定源於對人性的全盤掌握。人性是什麼呢?人性的根本定義就是情緒,情緒是人類表達的基礎,也是處理、敘述、思考或反應任何事情背後的根本,讓電腦具備察覺和回應人類情緒的能力,可以使得人機互動更加的自然。 過去的研究中,語音情緒辨識系統嚐試了各種不同的分類器,這些分類器又是在不同語言、不同語料庫大小、不等數目的情緒狀態、亦或是不同錄音方法的情況下進行實驗,使得很難針對這些分類器進行實驗結果之間的比較及效能評估,本論文中,我們提出一個”權重式離散型最近鄰居分群演算法”(WD-KNN)分類器演算法,並將此演算法與過去研究所採用的幾種演算法針對我們所建立的中文情緒語音語料庫進行實驗比較及效能評估。 實驗的進行分為下列幾個部份,首先,我們以最近鄰居法為基準系統,決定出合適的參數k以及最佳特徵參數集,不同k值的實驗結果顯示當k值等於10時,最近鄰居法可以得到最好的辨識率達70.7%,為了使得後序的實驗結果能夠在公平的狀況下進行比較,實驗所採用以最近鄰居法為基礎所改良出來的各種相關演算法中的k值一律設為10,另外,最佳特徵參數集包含線性預測係數(LPC)、線性預測倒頻譜係數(LPCC)及梅爾頻率倒頻譜係數(MFCC)三種特徵,跟特徵選擇之前(也就是使用所有的特徵參數)的實驗結果做比較,辨識率提升2.1%,特徵參數從13種降為3種。 然後,我們針對包含權重式最近鄰居法(WKNN)、各類別平均最近鄰居法(WCAP)、權重式離散型最近鄰居分群演算法(WD-KNN)等三種以最近鄰居法為基礎的分類演算法進行在不同權重序列下的實驗結果比較,與基準系統比較,這三種分類演算法分別可以提升辨識率最多達4.9%、2.8%及12.3%,權重式離散型最近鄰居分群演算法使用費氏級數權重序列達到最高辨識率為81.4%。 接著,我們採用包含最近鄰居法(KNN)、改良式最近鄰居法(MKNN)、權重式最近鄰居法(WKNN)、線性判別分析(LDA)、二次判別分析(QDA)、高斯混合模型(GMM)、隱藏式馬可夫模型(HMM)、支持向量機(SVM)、倒傳遞神經網路(BPNN)、及權重式離散型最近鄰居分群演算法(WD-KNN)等分類器針對我們所建立的語料庫進行中文語音情緒辨識的效能評估,實驗結果及麥內瑪檢定(McNemar’s test)的結果皆顯示我們所提出的權重式離散型最近鄰居分群演算法優於其他分類器,達到最好的辨識率,為了進行再次驗證,我們將以上分類器針對另一個兩千句大小的中文情緒語音語料庫進行實驗,實驗結果仍然顯示我們所提出的權重式離散型最近鄰居分群演算法優於其他分類器。 最後,我們實作了一個以權重式離散型最近鄰居分群演算法為核心的情緒評估方法,在情緒評估系統中我們設計了一個能夠呈現一個語句在每個情緒的強度的情緒雷達圖,希望這個情緒評估系統能夠應用於語言訓練中,尤其是幫助聽障人士學習在說話時像一般人一樣自然地將情緒表達出來。

並列摘要


It is said that technology comes out from humanity. What is humanity? The very definition of humanity is emotion. Emotion is the basis for all human expression and the underlying theme behind everything that is done, said, thought or imagined. Making computers being able to perceive and respond to human emotion, the human-computer interaction will be more natural. In the past, several classifiers were adopted independently and tested on several emotional speech corpora with different language, size, number of emotional states and recording method. This makes it difficult to compare and evaluate the performance of those classifiers. In this thesis, we proposed a weighted discrete k-nearest neighborhood (WD-KNN) classification algorithm and compared it with several classification methods to evaluate their performance by applying them to the same Mandarin emotional speech corpus. We first implemented a baseline system to determine the parameter k in KNN based classifiers and to select the best feature set. The results of different values of k in KNN classifier showed that the best performance 70.7% is achieved when the value of k is set to 10. To be fair in the comparison of the experiments, k is set to 10 in the KNN-based classifiers throughout this thesis. The best feature set includes LPC, LPCC, and MFCC. Compared to the performance before feature selection, the accuracy is improved 2.1% as the number of feature types are eliminated from 13 to 3. Next, we focused on comparison of different weighting schemes on KNN-based classifiers, including traditional K-Nearest Neighborhood (KNN), weighted KNN (WKNN), KNN classification using Categorical Average Patterns (WCAP), and WD-KNN. Compared to the baseline performance, the largest accuracy improvement of 4.9%, 2.8% and 12.3% can be achieved in these classifiers. The highest recognition rate is 81.4% with WD-KNN classifier weighted by Fibonacci sequence. Then we evaluated the performance of several classifiers, including KNN, MKNN, WKNN, LDA, QDA, GMM, HMM, SVM, BPNN, and the proposed WD-KNN, for detecting emotion from Mandarin speech. The results of experiments and McNemar’s test show that the proposed WD-KNN classifier achieves best accuracy for the 5-class emotion recognition and outperforms other classification techniques. Then, to verify the advantage of the proposed method, we compared these classifiers by applying them to another Mandarin expressive speech corpus consisting of two emotions and 2000 utterances. The experimental results still show that the proposed WD-KNN outperforms others. Finally, we implemented an emotion radar chart which is based on WD-KNN and can present the intensity of each emotion component in the speech in our emotion recognition system. Such system can be further used in speech training, especially for hearing-impaired to learn how to express emotions in speech more naturally.

參考文獻


[118] D. Morrison, R. Wang, L. C. De Silva, and W. L. Xu, “Real-time spoken affect classification and its application in call-centers,” in Proc. of the Third International Conference on Information Technology and Applications, vol. 1, pp. 483-487, July 2005.
[5] P. B. Denes and E. N. Pinson, The speech chain: Physics and biology of spoken language, W. H. Freeman, 1993.
[16] J. A. Russell, J. A. Bachorowski, and J. M. Fernandez, “Facial and vocal expressions of emotion,” Annual Review of Psychology, vol. 54, pp. 329-349, 2003.
[8] J. Eisenson and M. Ogilvie, Communicative disorders in children, 5th edition, Macmillan, New York, 1983.
[4] S. Planalp, “Communicating emotion in everyday life: Cues, channels, and processes,” in Handbook of communication and emotion: Research, theory, applications, and contexts, Andersen, P. A. and L. K. Guerro, eds., Academic Press, San Diego, pp. 29-48, 1998.

延伸閱讀