聽視覺特徵擷取在中文數字語音辨識之研究

近幾年來，許多語音視覺特徵用於輔助語音辨識的方法，已經發展出來，並具有良好的效能，用以克服語音辨識在背景吵雜下辨識不佳的缺點。本論文提出一個中文聽視覺語音辨識系統，可以提高在吵雜環境下或有情緒狀態下的語音辨識率。我們使用包括嘴唇的幾何與移動量特徵，做為後端辨識器的資料來源，這些視覺特徵對於辨識系統是非常重要的，尤其在吵雜環境下或當語音是有情緒的情況下。其中移動量特徵的擷取首先利用一個自動臉部特微擷取器以獲得嘴唇的特徵點，接著利用這些特徵點計算出幾何與移動量特徵。之後，建立一適用於中文聽視覺語音之辨識器，此辨識器運用離散權值KNN (WD-KNN)作為分類器。我們使用各種離散權值方法，來比較及驗證所提出的WD-KNN分類器，這些權值方法包含線性距離函式、倒距離函式、排名法與費氏級數函式。由實驗結果可知，使用WD-KNN分類器與費氏級數函式可以在背景吵雜的環境下獲得較其他分類與權值函式方法較好的辨識結果。最後，我們加入具有情緒的語音作進一步的研究。在有情緒的聽視覺語音狀況下作辨識，實驗結果顯示若加入視覺特徵並與聽覺特徵一起作為分類的依據，將可獲得較佳的辨識率。

關鍵字

離散權值KNN (WD-KNN)分類器；最近k-鄰居距離法(KNN) ；特徵擷取；中文聽視覺語音辨識系統

並列摘要

In recent years, there have been many machine speechreading systems proposed, that combine audio and visual speech features. For all such systems, the objective of these audio-visual speech recognizers is to improve recognition accuracy, particularly in difficult condition. This thesis presents a Mandarin audio-visual recognition system that has better recognition rate in noisy condition as well as speech spoken with emotional condition. We first extract the visual features of the lips, including geometric and motion features. These features are very important to the recognition system especially in noisy condition or with emotional effects. The motion features are obtained by applying an automatic face feature extractor followed by a fast motion feature extractor. We compare the performance when the system using motion and geometric features. In this recognition system, we propose to use the weighted-discrete KNN as the classifier and compare the results with two popular classifiers, the GMM and HMM, and evaluate their performance by applying to a Mandarin audio-visual speech corpus. We find that the WD-KNN is a suitable classifier for Mandarin speech because the monosyllable property of Mandarin and computationally inexpensive. The experimental results of different classifiers at various SNR levels are presented. The results show that using the WD-KNN classifier yields better recognition accuracy than other classifiers for the used Mandarin speech corpus. Several weighting functions were also studied for the weighted KNN based classifier, such as linear distance weighting, inverse distance weighting, rank weighting and reverse Fibonacci weighting function. The overall results have proved that WD-KNN classifier with reverse Fibonacci weighting function gets the higher recognition rate in three extended versions of KNN outperform others. Finally, we perform the emotional speech recognition experiments. The results show that it will be more robust if the visual information is included. The recognition rate of the audio-visual speech recognition system will have higher recognition rate when incorporated with the visual cues.

並列關鍵字

Audio-visual recognition ； K-nearest neighbor ； weighted-discrete KNN (WD-KNN) ； feature extraction

參考文獻

[1] H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, pp. 746-748, Dec. 1976.

[3] M. I. Faraj and J. Bigun, “Synergy of Lip-Motion and Acoustic Features in Biometric Speech and Speaker Recognition”, IEEE Trans. Computers, vol. 56, no. 9, pp. 1169-1175, Sep. 2007.

[4] K. Farrell, R. Mammone, and K. Assaleh, “Speaker Recognition Using Neural Networks and Conventional Classifiers,” IEEE Trans. Speech and Audio Processing, vol. 2, no. 1, pp. 194-205, 1994.

[5] M. Heckmann, F. Berthommier and K. Kroschel, “Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition,” EURASIP Journal on Applied Signal processing, 2002:11, pp. 1260-1273, 2002.

[7] T. Chen and R. Rao, “Audiovisual interaction in multimedia communication,” ICASSP, vol. 1. Munich, pp. 179-182, Apr. 1997.

國際替代計量

聽視覺特徵擷取在中文數字語音辨識之研究

未授權

主題瀏覽