  • 學位論文


Enhancing Speech Features in Various Domains for Noise-Robust Speech Recognition

指導教授 : 洪志偉


雜訊干擾的存在造成自動語音辨識系統其發展環境和應用環境兩者之間的不匹配, 進而導致語音辨識精確度不佳。目前處理此問題之技術可粗分為三類: 語音強化、強健性語音特徵參數呈現與語音模型調適, 而本論文所發展與討論的新方法主要是屬於強健性語音特徵參數呈現此類。 在語音辨識中, 梅爾倒頻譜係數為最廣為使用的語音特徵之一。在本論文中, 我們主要是探 討梅爾倒頻譜係數求取過程中各域之雜訊干擾的特性, 進而發展對應之強健性演算法, 詳述如下: • 線性頻譜域: 強度頻譜增強法(Magnitude Spectrum Enhancement) • 梅爾頻譜域: 混合式倒頻譜正規化法(Hybrid Cepstral Statistics Normalization) • 倒頻譜域: 調變頻譜替代法(Modulation Spectrum Replacement, MSR) 與調變頻譜濾波法(Modulation Spectrum Filtering, MSF) 我們採用了Aurora2連續數字語料庫之辨識實驗來檢測我們所提出之新方法的效能, 實驗結果顯示, 上述之新方法皆能有效改善原始梅爾倒頻譜係數在雜訊干擾環境下的辨識精確度,同時, 當與現行之眾多強健性技術相較, 這些新方法大多可達到相近甚至更佳的效能, 足見它們的應用與發展價值。


The performance of an automatic speech recognition (ASR) system is often degraded due to the various types of noise and interference in the application environment. In this disseration, we aim to develop robustness methods specifically for handling additive noise and channel disturbance. In particular, these developed methods are used to refine the mel-frequency cepstral coefficient (MFCC), which is one of the most widely used speech feature representation in ASR. At first, we discuss the effect of noise in the linear spectral domain of MFCC, and then present the approach of magnitude spectrum enhancement (MSE) to refine the spectrum of speech signals. Next, the method of hybrid cepstral statistics normalization is presented to process the MFCC in the mel-spectral domain. Finally, two novel compensation algorithms, modulation spectrum replacement (MSR) and modulation spectrum filtering (MSF), are provided to enhance the MFCC in the cepstral domain. The recognition experiments conducted on the Aurora-2 connected-digit database show that the aforementioned novel methods are capable of improving the recognition accuracy of the MFCC in various noise conditions, and in most cases they perform better than, or at least similarly to, the state-of-the-art noise robustness techniques such as Wiener filtering (WF), spectral subtraction (SS), mean and variance normalization (MVN) and histogram equalization (HEQ).


[1] The teaching materials of ”Spoken Language Processing”, from Prof. Berlin Chen, http://berlin.csie.ntnu.edu.tw/.
[2] C. Becchetti and L. P. Ricotti, Speech recognition: theory and C++ implementation, 1st ed. Wiley, 1999.
[3] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” Journal of the Acoustic Society of America, vol. 55, pp. 1304–1312, Jan. 1974.
[4] S. Tibrewala and H. Hermansky, “Multiband and adaptation approaches to robust speech recognition,” in 1997 Eurospeech Conference on Speech Communications and
Technology (Eurospeech 1997).


