調變頻譜冪次展開法於強健語音辨識之研究

在本篇論文中，我們提出一個新的語音辨識演算法，用以增強語音辨識雜訊之強健性，本演算法是根據語音特徵序列其頻譜集中在中低頻的前提，進而將受到雜訊干擾的語音特徵序列在強度調變頻譜上做冪次係數上的調整，藉此得到更具強健性或鑑別力的語音特徵時間序列。相較於其他種類的強度調變頻譜更新法，此新方法的主要優點在於，我們不必藉由乾淨調變頻譜做為參考值，即可得到強健度相近的語音特徵。我們將上述提出的新方法運用在國際通用的AURORA-2語音資料庫的實驗上時，藉由不同的動態冪次係數設定，可以使原先之時間序列特徵正規化法得到更好的辨識率。同時，我們將此新方法作用於經過各種統計正規化技術預處理後的特徵上，相對於原單一統計正規化技術而言，能得到更佳的辨識效能，印證了此新方法與其他強健技術優越的加成性。

關鍵字

語音辨識；強健性語音；調變頻譜

並列摘要

In this thesis, we present a novel approach to enhancing the speech features in the modulation spectrum for better recognition performance in noise-corrupted environments. In the presented approach, termed modulation spectrum power-law expansion (MSPLE), the speech feature temporal stream is first pre-processed by some statistics compensation technique, such as cepstral mean and variance normalization (CMVN), cepstral gain normalization (CGN) and cepstral histogram normalization (CHN), and then the magnitude part of the modulation spectrum (Fourier transform) for the feature stream is raised to a power (exponentiated). We find that MSPLE can highlight the speech components and reduce the noise distortion existing in the statistics-compensated speech features. With the Aurora-2 digit database and task, experimental results reveal that the above process can consistently achieve very promising recognition accuracy under a wide range of noise-corrupted environments. MSPLE operated on MVN-preprocessed features brings about 45% in error rate reduction relative to the MFCC baseline and significantly outperforms the single MVN. Furthermore, performing MSPLE on the low-half sub-band modulation spectra gives the results very close to those from the full-band modulation spectra updated by MSPLE, indicating that a less-complicated MSPLE suffices to produce noise-robust speech features.

並列關鍵字

speech recognition ； robust speech features ； modulation spectrum

參考文獻

[1] X. Huang, A. Acero and H. W. Hon, "Spoken language processing: a guide to theory, algorithm, and system development," Prentice Hall PTR, 2001.

Google Scholar

[2] J. Bensety, M. M. Sondhi and Y. Huang, “Springer handbook of speech processing,” Springer, pp. 653-679, 2007.

Google Scholar

[3] A. Benyassine, E. Shlomot, H. Y. Su, D. Massaloux, C. Lamblin and J. P. Petit, “ITU-T Recommendation G.729 Annex B: a silence compression scheme for use with G.729 optimized for V.70 digital simultaneous voice and data applications,” IEEE Communications Magazine, 35(9), pp. 64-73, 1997.

Google Scholar

[4] L. Deng, J. Droppo and A. Acero,“Recursive estimation of non-stationary noise using iterative stochastic approximation for robust speech recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, 11(6), pp. 568-580, 2003.

Google Scholar

[5] S. Furui,“Cepstral gain normalization for noise robust speech recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, 29(2), pp. 254-272, 1981.

Google Scholar

國際替代計量

調變頻譜冪次展開法於強健語音辨識之研究

全文下載

主題瀏覽