類神經網路訓練結合局部資訊於強健性語音辨識之研究

語音是一個在人類社會中不可或缺的要素，隨著科技的進步，人們依靠電腦來處理生活中大大小小事情的比例越來越高，因此為了使電腦能夠處理語音的資料，語音辨識即成為了一個重要的課題。目前的語音辨識技術在乾淨的數字語音辨識上能夠有很好的辨識結果，但是我們實際生活的環境充滿了與我們辨識內容無關的噪音，隨著訊噪比(SNR)越來越低，語音辨識率也不可避免的隨之下降。因此，找出能夠在噪音環境下提升進行語音辨識的方法在我們實際生活的應用上顯得非常重要。近年來，類神經網路 (Neural Network) 在語音辨識上的研究有著豐碩的成果，有效地減少環境以及語者變異對語音訊號造成的影響，大幅提升辨識率，但系統的語音辨識能力仍有改善空間。本論文即提出新的自動語音辨識系統架構，結合Environment Clustering (EC)、Mixture of Experts與類神經網路以進一步提升系統效能。我們將辨識系統分為Offline與Online兩階段：Offline階段依據聲學特性將整個訓練資料集分割成多個子訓練資料集，並建立各子訓練資料集的類神經網路(以類神經子網路稱之)。Online階段則使用GMM-gate來控制類神經子網路的輸出。新提出的系統架構保留子訓練資料集的聲學特性，強健語音辨識系統。實驗上，我們使用Aurora 2連續數字語音資料庫，依據字錯誤率(word error rate, WER)比較我們提出的語音辨識系統架構與傳統以類神經網路建立的辨識系統，平均字錯誤率進步6.86% ，由5.25%降低至4.89%。

關鍵字

類神經網路；強健性語音辨識；環境群集

並列摘要

Speech sounds is an essential element in human society. With the advance of science and technology, the proportion of people rely on computers to handle everything in our daily life more and more. In order to make the computer capable of handling speech data, speech recognition has become an important issue. Automatic speech recognition (ASR) in clean speech data can achieve good results but the environment we live is full of noise. As the speech SNR get lower and lower, the speech recognition accuracy inevitably decreased. For this reason, find a way to improve the noise speech recognize capability is important in our actual life. Recently, ASR using neural network (NN) based acoustic model (AM) has achieved significant improvements. However, the mismatch (including speaker and speaking environment) of training and testing conditions still confines the applicability of ASR. This paper proposes a novel approach that combines the environment clustering (EC) and mixture of experts (MOE) algorithms (thus the proposed approach is termed EC-MOE) to enhance the robustness of ASR against mismatches. In the offline phase, we split the entire training set into several subsets, with each subset characterizing a specific speaker and speaking environment. Then, we use each subset of training data to prepare an NN-based AM. In the online phase, we use a Gaussian mixture model (GMM)-gate to determine the optimal output from the multiple NN-based AMs to render the final recognition results. We evaluated the proposed EC-MOE approach on the Aurora 2 continuous digital speech recognition task. Comparing to the baseline system, where only a single NN-based AM is used for recognition, the proposed approach achieves a clear word error rate (WER) reduction of 6.86 % (5.25% to 4.89%).

並列關鍵字

Artificial Neural Network ； Robust Speech Recognition ； Environment Clustering

參考文獻

[1] A. Sankar and C.-H. Lee, “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Transactions on Speech Audio Processing, vol. 4, pp.190-202, 1996.

[2] A. Varga and R. Moore, “Hidden Markov Model Decomposition of Speech And Noise," in Proceedings IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 845-848, 1990.

[3] M. J. F. Gales and S. Young, "An Improved Approach To The Hidden Markov Model Decomposition of Speech And Noise," IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 233-236, 1992.

[4] M. Q. Wang and S. J. Young, “Speech Recognition Using Hidden Markov Model Decomposition And a General Background Speech Model," IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 253-256, 1992.

[5] Y. Tsao and C.-H. Lee, "An Ensemble Speaker and Speaking Environment Modeling Approach to Robust Speech Recognition," IEEE Transactions on Audio, Speech and Language Processing, vol. 17, pp. 1025-1037, Jun. 2009.

國際替代計量

類神經網路訓練結合局部資訊於強健性語音辨識之研究

未授權

主題瀏覽