使用字典學習法於強健性語音辨識

在有雜訊的環境下，自動語音辨識系統(Automatic Speech Recognition, ASR)的效能往往會有明顯衰退的現象。本論文旨在研究語音強健性技術，希望能夠透過語音特徵的調變頻譜(Modulation Spectrum)正規化以萃取出較具有強健性的語音特徵。為此，我們使用K-奇異值分解(K-SVD)的字典學習法(Dictionary Learning)於分解調變頻譜的強度(Magnitude)成分，在最小化還原訊號誤差且在其權重矩陣稀疏性的限制下，希望能獲取較具強健性的語音特徵。此外，因調變頻譜強度成分皆為正值，所以我們提出非負K-SVD的方法來解決這個議題，希望能增進自動語音辨識系統在抗噪上的效能。本論文的所有實驗皆於國際通用的Aurora-2 連續數字資料庫進行；實驗結果顯示相較於僅使用梅爾倒頻譜係數(Mel-Frequency Cepstral Coefficient, MFCC)之基礎實驗和其它常見的調變頻譜分解方法，我們所提出的字典學習法與其改進方法皆能顯著地降低語音辨識錯誤率。最後，我們也嘗試將所提出的字典學習方法與一些經典的強健性技術結合，如：進階前端標準法(Advanced Front-End, AFE)、變異數正規化法(Cepstral Mean and Variance Normalization, CMVN)、統計圖等化法(Histogram Equalization, HEQ)，以驗證其實用性。

關鍵字

強健性；自動語音辨識；調變頻譜；稀疏編碼；字典學習法

並列摘要

The performance of automatic speech recognition (ASR) often degrades dramatically in noisy environments. In this paper, we present a novel use of dictionary learning approach to normalizing the magnitude modulation spectra of speech features so as to retain more noise-resistant and important acoustic characteristics. To this end, we employ the K-SVD method to create sparse representations for a common set of basis vectors that span the intrinsic temporal structure inherent in the modulation spectra of clean training speech features. In addition, taking into account the non-negativity property of amplitude modulation spectrum, we utilize the nonnegative K-SVD method, paired with the nonnegative sparse coding method, to capture more noise-robust features. All experiments were conducted on the Aurora-2 corpus and task. The empirical evidence shows that our methods can offer substantial improvements over the baseline NMF method. Finally, we also integrate the proposed variants of the K-SVD method with other well-known robustness methods like Advanced Front-End (AFE), Cepstral Mean and Variance Normalization (CMVN) and Histogram Equalization (HEQ) to further confirm their utility.

並列關鍵字

Robustness ； Automatic Speech Recognition ； Modulation Spectrum ； Sparse Coding ； Dictionary Learning

參考文獻

Aharon, M.,Elad, M.,Bruckstein, A. M.(2006).The KSVD: An algorithm for designing of overcomplete dictionaries for sparse representations.IEEE Transactions on Signal Processing.54,4311-4322.

Google Scholar

Chen, C. P.,Bilmes, J. A.(2007).MVA processing of speech features.IEEE Transactions on Audio Speech and Language Processing.15(1),257-270.

Google Scholar

Chen, S. S.,Donoho, D. L.,Saunders, M. A.(2001).Atomic decomposition by basis pursuit.SIAM review.43(1),129-159.

Google Scholar

de la Torre, A.,Peinado, A.M.,Segura, J. C.,Perez-Cordoba, J. L.,Benitez, M. C.,Rubio, A. J.(2005).Histogram equalization of speech representation for robust speech recognition.IEEE Transactions on Speech and Audio Processing.13(3),355-366.

Google Scholar

Engan, K.,Aase, S. O.,Husoy, J. H.(1999).Method of optimal directions for frame design.Proc. of IEEE International Conference of Acoustic, Speech, and Signal Processing.5,2443-2446.

Google Scholar

國際替代計量

使用字典學習法於強健性語音辨識

全文下載

主題瀏覽