透過您的圖書館登入
IP:18.188.214.207
  • 學位論文

強健性語音辨識中分頻帶調變頻譜補償之研究

The Study of Sub-band Modulation Spectrum Compensation for Robust Speech Recognition

指導教授 : 洪志偉

摘要


雖然語音科技進步迅速,但自動語音辨識仍是一門值得繼續研究開發的課題。因為目前多數的語音辨識系統應用於不受干擾的安靜環境,雖然能得到相當滿意的辨識效果,但若將其應用於實際的環境中,語音訊號往往會因為環境雜訊的影響,導致辨識效能有明顯地衰減,發展多年的強健性技術即是針對此項缺點作改進。 在諸多強健性技術中,有一類方法為對語音特徵作統計上的正規化,傳統上,這些方法都是對全頻段的語音特徵時間序列做正規化處理,然而,在分析此類方法的效能上,通常是以其調變頻譜的正規化程度作為效能的依據,因此,如果直接在語音特徵之調變頻譜上作正規化,應亦可達到不錯的效果。另外,由於不同頻率的調變頻率成份具有不相等的重要性,但是傳統之特徵時間序列正規化法相對忽略了此性質,基於這些觀察,在本論文中,我們提出了一系列的分頻段調變頻譜統計正規化法,此類方法可以分別正規化不同頻段的統計特性,進而提升語音特徵在雜訊環境下的強健性能;在國際通用的Aurora-2連續數字資料庫之語音辨識上,我們所提出的新方法相對於基礎實驗的辨識率而言,可以達到高達65%的相對錯誤降低率,而這些新的調變頻譜正規化法相對於時間序列正規化法而言,於相對錯誤降低率上也有7%至32%的進步空間,此足以驗證這些新方法能夠更有效地提昇語音辨識系統在雜訊環境下的辨識效能。

並列摘要


In this paper, we propose a novel scheme in performing feature statistics normalization techniques for robust speech recognition. In the proposed approach, the processed temporal-domain feature sequence is first converted into the modulation spectral domain. The magnitude part of the modulation spectrum is decomposed into non-uniform sub-band segments, and then each sub-band segment is individually processed by the well-known normalization methods, like mean normalization (MN), mean and variance normalization (MVN) and histogram equalization (HEQ). Finally, we reconstruct the feature stream with all the modified sub-band magnitude spectral segments and the original phase spectrum using the inverse DFT. With this process, the components that correspond to more important modulation spectral bands in the feature sequence can be processed separately. For the Aurora-2 clean-condition training task, the new proposed sub-band spectral MN, MVN and HEQ provide relative error rate reductions of 18.66% and 23.58% over the conventional temporal MVN and HEQ, respectively.

參考文獻


[1] 王小川, "語音訊號處理", 全華科技圖書, 2004
[2] Yifan Gong, "Speech recognition in noisy environments:a survey", Speech Communication, Vol. 16, pp.261-291, 1995
[3] Mark John Francis Gales, "Model-based techniques for noise robust speech recognition", Ph.D. thesis, University of Cambridge, United Kingdom, Sep. 1995
[4] Steven B. Davis and Paul Mermelstein, 'Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences', IEEE Trans. on Acoustics, Speech and Signal Processing, pp.357-366, 1980
[5] Shajith Ikbal, Hemant Misra and Herve Bourlard, 'Phase autocorrelation (PAC) derived robust speech features', 2003 International Conference on Acoustics, Speech and Signal Processing (ICASSP 2003), pp.133-136

延伸閱讀