透過您的圖書館登入
IP:18.191.5.239
  • 學位論文

最小音素錯誤模型及特徵訓練法於中文大詞彙辨識上之初步研究

Minimum Phone Error Training of Acoustic Models and Features for Large Vocabulary Mandarin Speech Recognition

指導教授 : 李琳山

摘要


傳統的語音模型訓練是以最大相似度估測(Maximum Likelihood Estimation, MLE)來訓練隱藏式馬可夫模型的參數,這樣的方式可以使正確轉譯在訓練語料中有最大的事後機率,然而卻無法將易混淆的模型有效地拉開。鑑別式(discriminative)模型訓練同時考慮辨識結果和正確轉譯以調整模型,設法將易混淆的模型在高維空間中分離。 本論文以最小音素錯誤模型訓練法(Minimum Phone Error, MPE)及最小音素錯誤特徵訓練法(feature-space Minimum Phone Error, fMPE)為主軸,詳細地介紹鑑別式模型訓練的背景知識、理論基礎及實驗結果。論文大約可分為四個部分: 第一部分為鑑別式訓練所需要的基礎學理,包括風險預測(Risk Estimation)及輔助函數(Auxiliary function),風險預測以最小貝氏風險(Minimum Bayesian Risk)為出發點,介紹目前被廣泛研究的數個聲學模型訓練法,包括最大相似度估測法、最大交互資訊法(Maximum Mutual Information, MMI)、全面風險法則預測(Overall Risk Criterion Estimation, ORCE)以及最小音素錯誤模型訓練法(Minimum Phone Error, MPE),這些訓練法的目標函數皆可視為貝氏風險的延伸變形。另外,論文也回顧輔助函數的數學概念,這些概念將會用在最小音素錯誤模型訓練法目標函數的推導,包括強性輔助函數(Strong-sense Auxiliary Function)、弱性輔助函數(Weak-sense Auxiliary Function)及平滑函數(Smoothing Function),強性及弱性輔助函數可藉由疊代(iterative)法輔助原先的目標函數尋找局部最佳解(local optimal),在使用弱性輔助函數求解時,加入適當的平滑函數可改善其收斂速度。 第二部分為論文中所使用的實驗架構,包括師大的新聞語料庫、詞典及語言模型的建立。辨識器的大詞彙連續語音辨識方式採由左至右(Left-to-right)、音框同步(Frame-synchronous)的詞彙樹複製搜尋(Tree Copy Search)方式。本論文將分別以梅爾倒頻譜係數(Mel Frequency Cepstrum Coefficient, MFCC)及經過異質性鑑別分析(Heteroscedastic Linear Discriminant Analysis, HLDA)處理的特徵,以最大相似度估測訓練的結果當成基礎實驗。 第三部分為最小音素錯誤模型訓練法,此方法直接以訓練語料的音素正確率期望值為目標函數。目標函數以強性及弱性輔助函數替換並加上平滑函數,可求得模型參數的更新(Update)公式,從公式可看出模型參數的更新可以使模型靠近達到正確辨識的特徵,也就是屬於分子詞圖(numerator lattices)的特徵,同時遠離發生錯誤辨識的特徵,也就是屬於分母詞圖(denominator lattices)的特徵。I-平滑技術引入待測模型的事前分布(以最大相似度估測)使模型的估測達到最佳化。另外,論文將介紹音素正確率之近似,如何以詞圖近似所有辨識結果並以詞圖前向後向演算法(word graph forward-backward algorithm)計算詞圖上某一詞弧(word arc)其所有前接(preceding)路徑及後續(following)路徑的平均正確率,以求得通過該詞弧所有詞串的音素平均正確率。實驗結果證明此方法在公視新聞語料上可以減少約3%的字錯誤率。 第四部分為最小音素錯誤特徵訓練法(fMPE),此方法先將特徵投影到高維空間再降至低維產生一偏移向量,將原來的特徵加上此偏移向量達到鑑別的效果。由高維到低維的轉換矩陣是以最小音素錯誤模型訓練法的目標函數對特徵作微分,再由梯度下降法(gradient descent)進行更新。對特徵的微分的又可分為直接微分估測及間接微分估測,間接微分估測可將模型的變化反應在特徵上,故兩者同時使用可以使整個訓練過程在「更新模型」及「更新特徵」之間交替進行。偏移性最小音素錯誤特徵訓練法(offset fMPE)和前者的不同之處在於高維向量的求法,此方法可以約1/4的運算量達到與前者相似的進步。此外本論文並提出一種具有維度權重偏移性最小音素錯誤特徵訓練法(dimension-weighted offset fMPE),對於高維向量根據每一維度的重要性給予不同的權重。實驗結果證明此三種方法在公視語料上可減少約3%的字錯誤率,其中第三種具有維度權重偏移性最小音素錯誤特徵訓練法較偏移性最小音素錯誤特徵訓練法可再進步約0.4%,並且在訓練的過程中具有較高的強健性。

關鍵字

最小音素錯誤

並列摘要


Traditional speech recognition uses maximum likelihood estimation to train parameters of HMM. Such method can make correct transcript have largest posterior probability; however it can’t separate confused models effectively. Discriminative training can take correct transcript and recognized result into consideration at the same time, trying to separate confused models in high dimensional space. Based on minimum phone error (MPE) and feature-space minimum phone error (fMPE), the thesis will introduce discriminative training’s background knowledge, basic theory and experimental results. The thesis has four parts: The first part is the basic theory, including risk estimation and auxiliary function. Risk estimation starts from minimum Bayesian risk, introducing widely explored model training methods, including maximum likelihood estimation, maximum mutual information estimation, overall risk criterion estimation, and minimum phone error. The objective functions can be regarded as extension of Bayesian risk. In addition, the thesis will review strong-sense and weak-sense auxiliary functions and smoothing function. Strong-sense and weak-sense auxiliary functions can be used to find the optimal solution. When using weak-sense auxiliary function to find solutions, adding smoothing function can improve convergence speed. The second part is the experimental architecture, including NTNU broadcast news corpus, lexicon and language model. The recognizer uses left-to-right, frame-synchronous tree copy search to implement LVCSR. The thesis uses maximum likelihood training results of mel frequency cepstrum coefficients and features processed by heteroscedastic linear discriminant analysis as baseline. The third part is minimum phone error. The method uses minimum phone error as direct objective function. From the update equation we can see that the newly trained model parameters are closer to correctly-recognized features (belong to numerator lattices) and move far away from wrongly-recognized features (belong to denominator lattices). The I-smoothing technique introduces model’s prior to optimize estimation. Besides, the thesis will introduce the approximation of phone error-how to use lattice to approximate all recognized results and how to use forward-backward algorithms to calculate average accuracy. The experimental results show that this method can reduce 3% character error rate in the corpus. The fourth part is the feature-space minimum phone error. The method projects features into high-dimension space and generate an offset vector added to original feature and leads to discrimination. The transform matrix is trained by minimum phone error followed by gradient descent to do update. There are direct differential and indirect differential. Indirect differential can reflect the model change on features so that feature training and model training can be done iteratively. Offset feature-space minimum phone error is different in the high dimension feature. The method can save 1/4 computation and achieve similar improvement. My thesis proposed dimension-weighted offset feature-space minimum phone error which treats different dimensions with different weights. Experimental results show that theses methods have 3% character error rate reduction. Dimension-weighted offset feature-space minimum phone error has larger improvements and more robust in training.

並列關鍵字

minimum phone error

參考文獻


[Wang et al, 2005] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S Cheng (2005)“MATBN: A Mandarin Chinese Broadcast News Corpus,”International Journal of Computational Linguistics and Chinese Language Processing, 2005
[Zhang, 2005] 張志豪 (2005)“Robust and Discriminative Feature Extraction Techniques for Large Vocabulary Continuous Speech Recognition,”國立臺灣師範大學資訊工程研究所碩士論文, 2005.
[Kuo, 2005] 郭人瑋(2005)“An Initial Study on Minimum Phone Error Discriminative Learning of Acoustic Models for Mandarin Large Vocabulary Continuous Speech recognition,”國立臺灣師範大學資訊工程研究所碩士論文.
[Morgan, 2004] B. Chen, Q. Zhu, N. Morgan (2004)“Learning long-term temporal features in LVCSR using neural networks,”in Proc. ICSLP 2004.
[Aubert, 2002] X. L. Aubert (2002)“An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,”Computer Speech and Language, 2002.

被引用紀錄


程永任(2008)。最小音素錯誤訓練法及其改進方法在國語大字彙辨識上之評估與分析〔碩士論文,國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2008.02662

延伸閱讀