透過您的圖書館登入
IP:3.133.87.156
  • 學位論文

調變頻譜冪次展開法於強健語音辨識之研究

The study of modulation spectrum power-law expansion for robust speech recognition

指導教授 : 洪志偉

摘要


在本篇論文中,我們提出一個新的語音辨識演算法,用以增強語音辨識雜訊之強健性,本演算法是根據語音特徵序列其頻譜集中在中低頻的前提,進而將受到雜訊干擾的語音特徵序列在強度調變頻譜上做冪次係數上的調整,藉此得到更具強健性或鑑別力的語音特徵時間序列。相較於其他種類的強度調變頻譜更新法,此新方法的主要優點在於,我們不必藉由乾淨調變頻譜做為參考值,即可得到強健度相近的語音特徵。我們將上述提出的新方法運用在國際通用的AURORA-2語音資料庫的實驗上時,藉由不同的動態冪次係數設定,可以使原先之時間序列特徵正規化法得到更好的辨識率。同時,我們將此新方法作用於經過各種統計正規化技術預處理後的特徵上,相對於原單一統計正規化技術而言,能得到更佳的辨識效能,印證了此新方法與其他強健技術優越的加成性。

並列摘要


In this thesis, we present a novel approach to enhancing the speech features in the modulation spectrum for better recognition performance in noise-corrupted environments. In the presented approach, termed modulation spectrum power-law expansion (MSPLE), the speech feature temporal stream is first pre-processed by some statistics compensation technique, such as cepstral mean and variance normalization (CMVN), cepstral gain normalization (CGN) and cepstral histogram normalization (CHN), and then the magnitude part of the modulation spectrum (Fourier transform) for the feature stream is raised to a power (exponentiated). We find that MSPLE can highlight the speech components and reduce the noise distortion existing in the statistics-compensated speech features. With the Aurora-2 digit database and task, experimental results reveal that the above process can consistently achieve very promising recognition accuracy under a wide range of noise-corrupted environments. MSPLE operated on MVN-preprocessed features brings about 45% in error rate reduction relative to the MFCC baseline and significantly outperforms the single MVN. Furthermore, performing MSPLE on the low-half sub-band modulation spectra gives the results very close to those from the full-band modulation spectra updated by MSPLE, indicating that a less-complicated MSPLE suffices to produce noise-robust speech features.

參考文獻


[1] X. Huang, A. Acero and H. W. Hon, "Spoken language processing: a guide to theory, algorithm, and system development," Prentice Hall PTR, 2001.
[2] J. Bensety, M. M. Sondhi and Y. Huang, “Springer handbook of speech processing,” Springer, pp. 653-679, 2007.
[3] A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin and J. P. Petit, “ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, 35(9), pp. 64-73, 1997.
[4] L. Deng, J. Droppo and A. Acero,“Recursive estimation of non-stationary noise using iterative stochastic approximation for robust speech recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, 11(6), pp. 568-580, 2003.
[5] S. Furui,“Cepstral gain normalization for noise robust speech recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2), pp. 254-272, 1981.

延伸閱讀