以能量為基礎之語音正規化方法研究及其於語音端點偵測之應用

本論文主要探討強健(Robust)性語音辨識技術在不同噪音環境下的情況，並且於時間軸上研究雜訊語音(Noisy Speech)在對數能量上重建出乾淨語音(Clean Speech)對數能量的方法。基於每一語句對數能量特徵值的分佈特性，我們期望發展出一個有效的方法可以重刻雜訊語音對數能量的尺度，以減緩噪音環境所造成不匹配的情形，並達到更好的辨識率效果。根據時間軸上的語音訊號觀察顯示，低能量的語音音框比高能量的語音音框更容易受到加成性噪音(Additive Noise)的影響，並且當出現嚴重的加成性噪音影響的時候，對數能量特徵強度在語句中幾乎會整個被提高，因此我們提出一個簡單但是有效的方法，稱之為對數能量尺度重刻正規化技術(Log Energy Rescaling Normalization, LERN)，適當的重刻雜訊語音的對數能量特徵值使成為接近乾淨語音的環境狀況。語音辨識實驗採用的是包含多種噪音環境的語料，該語料是由歐洲電信標準協會(European Telecommunications Standards Institute, ETSI)所發?的Aurora-2.0語料庫，語料庫內容為英語發音的?續?字字?的小詞彙。提供有八種噪音來源和七種訊噪比(Signal-to-Noise Ratio, SNR)的情況。實驗方面，結果顯示對數能量尺度重刻正規化方法(LERN)的效果比其他的能量或對數能量上的正規化方法好。此外，另一組實驗則採用中文廣播新聞語料庫(Mandarin broadcast news corpus, MATBN)在大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)上的測試，並證明對數能量尺度重刻正規化方法(LERN)依然可以有效提升辨識率。

關鍵字

語音正規化；語音端點偵測

並列摘要

This thesis considered robust speech recognition in various noise environments, with a special focus on investigating the ways to reconstruct the clean time-domain log-energy features from the noise-contaminated ones. Based on the distribution characteristics of the log-energy features of each speech utterance, we aimed to develop an efficient approach to rescale the log-energy features of the noisy speech utterance so as to alleviate the mismatch caused by environmental noises for better speech recognition performance. As the time-domain phenomena of the speech signals reveal that lower-energy speech frames are more vulnerable to additive noises than higher-energy ones, and that the magnitudes of the log-energy features of the speech utterance tend to be lifted up when they are seriously interfered with additive noise, we therefore proposed a simple but effective approach, named log-energy rescaling normalization (LERN), to appropriately rescale the log-energy features of noisy speech to that of the desirable clean one. The speech recognition experiments were conducted under various noise conditions using the European Telecommunications Standards Institute (ETSI) Aurora-2.0 database. The database contains a set of connected digit utterances spoken in English. It offers eight noise sources and seven different signal-to-noise ratios (SNRs). The experiment results showed that the performance of the proposed LERN approach was considerably better than the other conventional energy or log-energy feature normalization methods. Another set of experiments conducted on the large vocabulary continuous speech recognition (LVCSR) of Mandarin broadcast news also evidenced the effectiveness of LERN.

並列關鍵字

Speech Feature Normalization ； Voice Activity Detection

參考文獻

[Chen et al. 2004] Berlin Chen, Jen-Wei Kuo, Wen-Hung Tsai, “Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription,” in Proc. ICASSP 2004.

[Chen et al. 2005] Berlin Chen, Jen-Wei Kuo, Wen-Huang Tsai, “Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No. 1, pp. 1-18, March 2005.

[Tai and Hung 2006] Chung-fu Tai and Jeih-weih Hung, “Silence Energy Normalization for Robust Speech Recognition in Additive Noise Environments,” in Proc. ICSLP 2006.

[Wang et al. 2005] Hsin-min Wang, Berlin Chen, Jen-Wei Kuo, and Shih-Sian Cheng, “MATBN: A Mandarin Chinese Broadcast News Corpus,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 2, June 2005, pp. 219-236.

[Aubert 2002] X. L. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, January 2002.

被引用紀錄

張云箐（2007）。最小均方演算法以及功率頻譜密度差異值用於雜訊消除的分析〔碩士論文，國立臺北科技大學〕。華藝線上圖書館。https://doi.org/10.6841/NTUT.2007.00019

國際替代計量

以能量為基礎之語音正規化方法研究及其於語音端點偵測之應用

主題瀏覽