透過您的圖書館登入
IP:18.218.61.16
  • 學位論文

基於高階次方與動差之廣義特徵正規化法之強健性語音辨識技術

Generalized Cepstral Normalization with Higher Power/Moment Order for Robust Speech Recognition

指導教授 : 李琳山

摘要


在強健性語音辨識的領域裡,倒頻譜正規化法是一個常見且相當有效的技術。 在過去已經發展很成熟的著名方法包括倒頻譜平均消去法(Cepstral Mean Subtraction, CMS)以及倒頻譜平均及變異量正規化法(Cepstral Mean and Variance Normalization, CMVN)。這些方法可以將梅爾倒頻譜係數(Mel-frequency Cepstral Coefficients , MFCC)的一階或二階動差正規化。本論文提出了一系列廣義的倒頻 譜正規化法,可將原本的低階動差正規化法延伸到高階動差正規化法或高階次 方領域正規化法。 本論文所提出的高階倒頻譜動差正規化法(Higher Order Cepstral Moment Normalization, HOCMN)可以把原本傳統的低階動差正規化法直接延伸到高階動 差。由於高階動差的大小容易受到極值所影響,而這些極值經常是造成統計上 不對稱或非常態平坦度較嚴重的地方,因此若針對這些部分做正規化,可有效 的把統計分佈調整到適當的對稱性及平坦度以補償雜訊所造成的破壞。 我們在第五章進一步提出了高階次方倒頻譜正規化法(Powered Cepstral Normalization, P-CN),也就是在高階次方領域裡做倒頻譜正規化。在高階次方 ii 領域裡,可以增強環境中雜訊擾動的成分,並經由正規化法移去這些成分,最 後再將正規化後的信號轉回適當的低階次方領域做辨識。 以上這兩種方法可以各自獨立使用,也可整合為一個廣義的倒頻譜正規化法。 我們在英文連續數字語料(AURORA 2)上的實驗可發現,相較於傳統低階動差正 規化法可以得到相當明顯的進步,而且對於各種不同種類及強度雜訊環境都有 一致的結論。除了在辨識率上的呈現外,我們也利用語音特徵的統計分析,進 一步探討廣義倒頻譜正規化法背後隱含的基本原理。

並列摘要


Cepstral normalization has been popularly used as a powerful approach to produce robust features for speech recognition. Good examples of approaches include the well known Cepstral Mean Subtraction (CMS) and Cepstral Mean and Variance Normalization (CMVN), in which either the first or both the first and the second moments of the Mel-frequency Cepstral Coefficients (MFCCs) are normalized. In this dissertation, we proposed a family of generalized cepstral normalization techniques with higher power/moment order based on two closely related approaches. The first approach is to try to normalize the MFCC parameters with respect to a few moments of higher orders, i.e., with orders higher than 1 or 2. The basic idea is that the higher order moments are more dominated by samples with larger values, which are very likely the primary sources of the asymmetry and abnormal flatness or tail size of the parameter distributions. Normalization with respect to these moments therefore puts more emphasis on these signal components and constrains the distributions to be more symmetric with more reasonable flatness and tail size. This is referred to as the Higher Order Cepstral Moment Normalization (HOCMN) in this dissertation. The second approach, Powered Cepstral Normalization (P-CN), is an improved approach proposed to normalize the MFCC parameters in the r1-th powered domain, iv where r1 > 1.0. The basic idea is that when the MFCC parameters are raised to a higher-order power, or the r1-th power, the harmful parts of environmental disturbances may be more emphasized than the speech features which are relatively smooth. Therefore performing the normalization in the domain of a higher-order power may be more helpful. Then we transform the features back by an 1/ r2 power order to a recognition domain where the acoustic events can be better distinguished. The unified formulation of the generalized cepstral normalization with higher power/moment order presented in this dissertation can be reduced to either HOCMN or P-CN as mentioned above, or integrate both of them together. Experimental results based on AURORA 2.0 testing environment showed that the recognition accuracy can be significantly improved consistently with the approaches proposed here for all types of noise and all SNR conditions. Fundamental principles behind the approaches proposed here are also analyzed and discussed based on the statistical properties of the distributions of the MFCC parameters.

參考文獻


[1] J. N. Holmes and N. C. Sedgwick, “Noise compensation for speech recognition
[2] D. H. Klatt, “A digital filterbank for spectral matching,” in Proceedings of
[4] A. P. Varga and R. K. Moore, “Hidden Markov model decomposition of speech
[5] A. D. Berstein and I. D. Shallom, “An hypothesized Wiener filtering approach to
speaker adaptation of continuous density HMMs,” in Computer Speech and

延伸閱讀