透過您的圖書館登入
IP:18.224.214.215
  • 學位論文

基於調變頻譜等化法之強健性語音辨識技術

Modulation Spectrum Equalization for Improved Robust Speech Recognition

指導教授 : 李琳山

摘要


在強健性語音辨識的領域中,時域濾波器(Temporal Filter)是一個常見且相當有效的技術。在過去已經發展很成熟的著名技術包括相對頻譜濾波器(Relative Spectra, RASTA),以及基於主成分分析(Principle Component Analysis, PCA) 和線性鑑別分析(Linear Discriminant Analysis, LDA) 所設計的資料導向(Data-driven) 時域濾波器。這些技術主要是針對語音參數的時間序列 (Time Trajectories)或是調變頻譜 (Modulation Spectrum) 設計濾波器,進而使得語音信號中的雜訊能得到有效的抑制; 然而這些傳統技術的缺點在於不能隨著外在雜訊環境的不同來作調整,因而難以在所有的雜訊環境下都有很好的表現。 本論文所提出的調變頻譜等化法 (Modulation Spectrum Equalization) 則可以視為一種可適性的時域濾波器,亦即我們可以對在不同雜訊環境下錄音的語句得到不同的濾波器頻率響應,因而能夠有效改善在各種不同雜訊環境下的辨識結果。在這些技術當中,我們首先藉由傅利葉轉換將語音參數的時間序列轉換至調變頻譜,而我們所提出的技術均是直接利用信號在調變頻譜上的分佈情形來做設計。在頻譜分佈等化法 (Spectral Histogram Equalization, SHE) 中,我們先利用乾淨的訓練語料,統計它們的調變頻譜機率分佈作為參考分佈,接著將測試語句的調變頻譜機率分佈等化至此參考分佈。而在雙頻帶頻譜分佈等化法(Two-band Spectral Histogram Equalization, 2B-SHE)中,我們利用調變頻譜上低頻和高頻通常帶有不同的語音資訊這項特色,將測試語句中的低頻和高頻部分,分別等化至不同的參考分佈,進而得到比頻譜分佈等化法更佳的辨識結果。而在量值比例等化法(Magnitude Ratio Equalization, MRE)中,我們則將測試語句在調變頻譜上的量值比例等化至由乾淨語料所計算出的量值比例參考值。 我們在英文連續數字語料 (Aurora 2) 和英文連續大字彙語料 (Aurora 4) 上的實驗發現,我們所提出的技術相較於傳統的時域濾波器技術在辨識率上有明顯的提昇,而且我們所提出的技術也可以和一些知名的倒頻譜正規化法作有效的結合以進一步提昇辨識正確率。而除了在辨識率上的呈現外,我們也從許多不同的面向來探討辨識率進步的原因,包含這些技術所求出的濾波器應、雜訊在調變頻譜上的行為、不同音素的辨識率,以及調變頻譜的距離….等。

並列摘要


We propose novel approaches for equalizing the modulation spectrum for robust feature extraction in speech recognition. In these cases the temporal trajectories of the feature parameters are first transformed into the magnitude modulation spectrum. In spectral histogram equalization (SHE) and two-band spectral histogram equalization (2B-SHE), we simply equalize the histogram of the modulation spectrum for each utterance to a reference histogram obtained from clean training data, or perform the this equalization with two sub-bands on the modulation spectrum. In magnitude ratio equalization (MRE), we define the magnitude ratio of lower to higher modulation frequency components for each utterance, and equalize this to a reference value obtained from clean training data. These approaches can be viewed as temporal filters that are adapted to each testing utterance. Experiments performed on the Aurora 2 and 4 corpora for small and large vocabulary tasks indicate that significant performance improvements are achievable for all noise conditions (additive or convolutional, different noise types, and different SNR values). We also show that additional improvements are obtainable when these approaches are integrated with cepstral mean and variance normalization (CMVN), histogram equalization (HEQ), or higher-order cepstral moment normalization (HOCMN). We analyze and discuss reasons why such improvements are achievable from different viewpoints with different sets of data, including adaptive temporal filtering, noise behavior on the modulation spectrum, phoneme types, and modulation spectrum distance.

參考文獻


[1] L. Deng and X. Huang, “Challenges in adopting speech recognition”,Communications of the ACM, vol. 47, no. 1, pp. 69–75, 2004.
[2] D. O'Shaughnessy, “Automatic speech recognition: History, methods and challenges”, invited paper, Pattern Recognition, 2008.
[3] M. J. F. Gales, S. J. Young, “The application of hidden Markov models in speech recognition”, Foundations and Trends in Signal Processing, vol. 1, Issue 3, 2008.
[4] M. J. F. Gales, S. J. Young, “Cepstral parameter compensation for HMM recognition”, Speech Communication, vol. 12, no. 3, pp. 231–239, July 1993.
[5] M. J. F. Gales, “Model-based techniques for noise robust speech recognition”, Ph.D dissertation, Cambridge University, 1995.

延伸閱讀