在本篇論文中,我們提出一個新的語音辨識演算法,用以增強語音辨識雜訊之強健性,本演算法是根據語音特徵序列其頻譜集中在中低頻的前提,進而將受到雜訊干擾的語音特徵序列在強度調變頻譜上做冪次係數上的調整,藉此得到更具強健性或鑑別力的語音特徵時間序列。相較於其他種類的強度調變頻譜更新法,此新方法的主要優點在於,我們不必藉由乾淨調變頻譜做為參考值,即可得到強健度相近的語音特徵。我們將上述提出的新方法運用在國際通用的AURORA-2語音資料庫的實驗上時,藉由不同的動態冪次係數設定,可以使原先之時間序列特徵正規化法得到更好的辨識率。同時,我們將此新方法作用於經過各種統計正規化技術預處理後的特徵上,相對於原單一統計正規化技術而言,能得到更佳的辨識效能,印證了此新方法與其他強健技術優越的加成性。
In this thesis, we present a novel approach to enhancing the speech features in the modulation spectrum for better recognition performance in noise-corrupted environments. In the presented approach, termed modulation spectrum power-law expansion (MSPLE), the speech feature temporal stream is first pre-processed by some statistics compensation technique, such as cepstral mean and variance normalization (CMVN), cepstral gain normalization (CGN) and cepstral histogram normalization (CHN), and then the magnitude part of the modulation spectrum (Fourier transform) for the feature stream is raised to a power (exponentiated). We find that MSPLE can highlight the speech components and reduce the noise distortion existing in the statistics-compensated speech features. With the Aurora-2 digit database and task, experimental results reveal that the above process can consistently achieve very promising recognition accuracy under a wide range of noise-corrupted environments. MSPLE operated on MVN-preprocessed features brings about 45% in error rate reduction relative to the MFCC baseline and significantly outperforms the single MVN. Furthermore, performing MSPLE on the low-half sub-band modulation spectra gives the results very close to those from the full-band modulation spectra updated by MSPLE, indicating that a less-complicated MSPLE suffices to produce noise-robust speech features.