調變頻譜特徵正規化於強健語音辨識 之研究

在自動語音辨識技術的發展上，語音強健性一直都是一門重要的研究議題。在眾多的強健性技術中，針對語音特徵參數進行強化與補償為其中之一大主要派別。其中，近年來已有為數不少的新方法，藉由更新語音特徵時間序列及其調變頻譜來提昇語音特徵的強健性。綜觀這些技術，絕大多皆是藉由正規化時間序列或調變頻譜之統計特性，以降低語句間不匹配的程度，進而提昇語音辨識系統之強健性。然而本論文嘗試以一個嶄新的觀點切入，以對調變頻譜進行分解與成分分析為目標，提出兩種調變頻譜正規化法。首先，本論文嘗試藉由非負矩陣分解(Nonnegative Matrix Factorization, NMF)擷取調變頻譜中重要的基底向量，並且藉此更新調變頻譜以求取更具強健性的語音特徵。其次，本論文進一步賦予調變頻譜機率的意義，採用機率式潛藏語意分析(Probabilistic Latent Semantic Analysis, PLSA)之概念，對調變頻譜施以機率式成分分析、進而擷取出較重要的成分以求得更具強健性的語音特徵。本論文之所有實驗皆於國際通用的Aurora-2連續數字資料庫進行。相較於使用梅爾倒頻譜特徵之基礎實驗，本論文的方法皆能顯著低降低詞錯誤率。此外，本論文也嘗試將所提方法跟一些知名的特徵強健技術做結合；實驗顯示，相對於單一方法而言，結合法皆可進一步提昇辨識精確率，代表所提之新方法與許多特徵強健技術有良好的加成性。

關鍵字

自動語音辨識；語音強健性；非負矩陣分解；機率式潛藏語意分析

並列摘要

The environmental mismatch caused by additive noise and/or channel distortion often degrades the performance of a speech recognition system seriously. Therefore, various robustness methods have been proposed, and one prevalent school of thought aims to refine the modulation spectra of speech feature sequences. In this thesis, we proposed two novel methods to normalize the modulation spectra of speech feature sequences. First, we leverage nonnegative matrix factorization (NMF) to extract a common set of basis spectral vectors that discover the intrinsic temporal structure inherent in the modulation spectra of clean training speech features. The new modulation spectra of the speech features, constructed by mapping the original modulation spectra into the space spanned by these basis vectors, are demonstrated with good noise-robust capabilities. Second, to the render modulation spectra of speech feature sequences with a probabilistic perspective, we employ probabilistic latent semantic analysis (PLSA) with a latent set of topic distributions to explore the relationship between each modulation frequency and the magnitude modulation spectrum as a whole. All experiments were carried out on the Aurora-2 database and task. Experimental results show that the updated features via NMF and PLSA maintain high recognition accuracy for matched mismatched noisy conditions, which is quite competitive when compared to those obtained by other existing methods.

並列關鍵字

speech recognition ； robustness method ； nonnegative matrix factorization ； probabilistic latent semantic analysis

參考文獻

Acero, A. (1990), “Acoustical and environmental robustness for automatic speech recognition,” Ph.D. Dissertation, Carnegie Mellon University.

Beyerlein, P., X. Aubert, R. Haeb-Umbach, M. Harris, D. Klakow, A. Wendemuth, S. Molau, H. Ney, Michael Pitz and A. Sixtus (2002), “Large vocabulary continuous dpeech recognition of broadcast news - The Philips/RWTH spproach,” Speech Communication, vol. 37: pp. 109-131.

Schuller, B., F. Weninger, M. W¨ollmer, Y. Sun, G. Rigoll (2010)., “Non-negative matrix factorization as noise-robust feature extractor for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.

Boll, S. F. (1979), “Supperssion of Acoutstic Noise in Speech Using Spectral Subtraction,” IEEE Transactions on Acoustics, Speech, and Signal Process., vol. 27(2): pp. 113-120.

Comon, P. (1994), “Independent component analysis – A new concept?” Signal Process., vol. 36, pp. 287-314.

國際替代計量

調變頻譜特徵正規化於強健語音辨識之研究

主題瀏覽