透過您的圖書館登入
IP:3.144.36.141
  • 學位論文

降低音框率語音在分散式語音辨認之研究

Recognition of Reduce Frame Rate Speech Data for Distributed Speech Recognition

指導教授 : 簡福榮 李立民
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


分散式語音辨識是在用戶端做特徵參數擷取,經由通訊通道傳輸至伺服端做語音辨識。在降低音框率(reduced frame rate,RFR)語音特徵參數傳輸下,本論文提出特徵參數間插重建法(features interpolation reconstruction)與模型適應法(model adaptation)用以補償所接收到的降低音框率語音特徵參數序列與由全音框率(full frame rate,FFR)語音所訓練出來的HMM模型(hidden Markov model)間之不匹配,其中模型適應法又再分為模型適應和狀態數適應。在歐洲電信標準協會(European Telecommunication Standards Institute,ETSI)所提出來的分散式語音辨識進階前端量化架構下,使用Aurora2資料庫,用戶端語音特徵參數序列以半音框率,1/3音框率,以及1/4音框率傳輸至伺服端做語音辨識,可降低通道傳輸量50 %至75 %。並利用句子辨識率與辨識處理時間當作效能測量準則。實驗結果顯示與全音框率句子辨識率74.95 %相較,特徵參數間插重建法降低句子辨識率僅0.87 %至4.34 %,但並未節省辨識運算量。模型適應也僅降低辨識率0.85 %至4.34 %,卻隨音框率減少可以節省伺服端47.44 %至68.67 %的辨識運算量負荷。而狀態數適應雖可節省69.60 %至86.71 %的辨識運算量,但是句子辨識率降低最多,僅適合使用在進階前端量化之半音框率下,其句子辨識率下降幅度為2.93 %。

並列摘要


As advances in technology, speech recognition has approached a significant level of recognition ability. Most speech recognition systems exploit Hidden Markov Model (HMM). A convenient toolkit called HTK (Hidden Markov Model Toolkit) for easily building and manipulating hidden Markov models now is available. Using the HTK, a Distributed Speech Recognition (DSR) system is built for software simulation in the thesis. We intend to investigate and compare the recognition rate by using down-sampled frame rate speech data such as half frame rate feature stream. In order to compensate the mismatch between the received speech data and HMMs trained by full frame rate speech data, two compensation approaches are investigated in the thesis. The first approach is called feature reconstruction or feature interpolation that uses linear interpolation to recover missing frames. The second approach is by adapting the HMMs. It can also be classified into model adaptation and state number adaptation. In the experiments, recognition rate and computational cost of CPU time are used as measure criteria to compare the performance. Among all, the feature interpolation method achieves the best recognition rate but has the largest computational burden. The model adaptation method performs only a little degradation of recognition rate. And the state number adaptation method is with the worst recognition rate and least computational cost.

參考文獻


[2]V. Digalakis, L. Neumeyer, and M. Perakakis, “Quantization of cepstral parameters for speech recognition over the World Wide Web,” IEEE J. Select. Areas Comm., vol. 17, no. 1, pp. 82–90, Jan. 1999.
[4]Z.-H. Tan and B. Lindberg, “Low-complexity variable frame rate analysis for speech recognition and voice activity detection,” IEEE Trans. Audio, Speech, Lang. Process., vol. 4, no. 5, pp. 798-807, Sep 2010.
[5]Z.-H. Tan, P. Dalsgaard, and B. Lindberg, “Exploiting temporal correlation of speech for error-robust and bandwidth-flexible distributed speech recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1391–1403, May 2007.
[6]S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. Acoust., Speech, Signal Process., vol. ASSP-27, no. 2, pp. 113–120, Feb. 1979.
[7]M. Triki and K. Janse, “Minimum subspace noise tracking for noise power spectral density estimation,” in Proc. ICASSP, pp. 29-32, May 2009.

延伸閱讀