最小音素錯誤訓練法及其改進方法在國語大字彙辨識上之評估與分析

傳統的語音模型訓練以最大相似度(Maximum Likelihood, ML)來訓練聲學模型，雖然可以使正確的轉寫在訓練語料中有最大的事後機率，卻無法保證錯誤的聲學特徵(feature)不會產生更大的事後機率。鑑別式訓練(discriminative training)同時將可能的辨識結果與正確轉寫納入訓練，設法避免不正確的聲學特徵產生高於正確轉寫的事後機率。本論文以最小音素錯誤訓練法(Minimum Phone Error, MPE)以及其改進方法為主軸，詳細介紹鑑別式訓練法的背景知識、理論基礎以及實驗結果。本論文可分為五個部份：第一部份為鑑別式訓練的基礎理論，從貝氏風險(Bayes Risk)開始，介紹目前廣泛研究的若干種模型訓練法，包括最大相似度估測法、最大相互資訊(Maximum Mutual Information, MMI)估測法、全面風險法則估測(Overall Risk Criterion Estimation, ORCE)、最小分類錯誤(Minimum Classification Error, MCE)訓練法以及最小音素錯誤(Minimum Phone Error, MPE)訓練法，這些訓練法的目標函數都可以視為貝氏風險的延伸。第二部份為本論文的實驗架構：包括師大的新聞語料庫；實驗的前端處理方式，梅爾倒頻譜係數(Mel-Frequency Cepstrum Coefficient, MFCC)；初始聲學模型的訓練，由HTK以最大相似度估測法訓練而成；詞典及語言模型的建立，以中央通訊社收集的文字語料由SRILM訓練而成；以及語音辨識工具，為台大語音實驗室的TTK。基礎實驗為初始聲學模型的辨識結果。第三部份為最小音素錯誤訓練法，先介紹目標函數最佳化的理論推導過程，求得模型參數的更新公式。再介紹模型參數的更新公式中，各項統計值在實作上的計算方法，其中包含正確度的定義，以及詞弧正確度和詞圖期望正確度的算法。實驗結果最小音素錯誤訓練法有約2.4%字正確率的進步。第四部份介紹最小音素錯誤訓練法的改進方法，包括最小音素音框錯誤 (Minimum Phone Frame Error, MPFE)訓練法、狀態層級最小貝氏風險(physical state level Minimum Bayes Risk, sMBR)訓練法和最小歧異度(Minimum Divergence, MD)訓練法，這些方法主要差異在於目標函數中正確度的定義。實驗結果包括最小音素錯誤訓練法的四種方法之中，除了最小歧異度訓練法之外的三種方法都可以在詞正確率以及字正確率上進步，其中又以最小音素錯誤訓練法在字正確率的表現最好，而詞正確率則是以最小音素音框錯誤訓練法表現最好。此外，本論文也在目標函數中正確度的定義做了更進一步的改進：在正確度中加入了錯誤處罰以及音素長度正規化，實驗結果這個正確度的改進版本會產生字正確率進步，而在詞正確率上退步的情形。第五部份介紹基於詞弧期望正確度的資料選取方法，目標是篩選出較具有鑑別力的詞弧納入訓練，實驗在最小音素錯誤訓練法和最小音素音框錯誤訓練法的其中一種修改版本上，實驗結果顯示資料選取對於正確率的變化並沒有很大的影響，不過可以加快訓練的收斂速度。

關鍵字

國語大字彙語音辨識；鑑別式訓練；最小音素錯誤訓練；最小音素音框錯誤；最小貝氏風險；最小歧異度

並列摘要

The traditional acoustic model training is based on Maximum Likelihood (ML). This training method maximizes the posterior probability of transcription in the training corpus, but cannot guarantee that the incorrect observations do not obtain a larger posterior probability. Discriminative training considers the possible recognition hypotheses and transcription into training at the same time, and tries to avoid the incorrect observations obtain larger posterior probability than correct ones. This thesis takes Minimum Phone Error (MPE) and its modified versions as the principal thing, detailed the background, theorem and experimental results of discriminative training. This thesis can be divided into five parts: The first part is the background and theorem of discriminative training. This part starts with Bayesian risk, and then introduces several popular model training methods, including Maximum Likelihood, Maximum Mutual Information (MMI) Estimation, Overall Risk Criterion Estimation (ORCE), Minimum Classification Error (MCE) training method and Minimum Phone Error (MPE) training method, the objective functions of these training methods can be considered as an extension of Bayesian risk. The second part is the experiments framework of this thesis, including Taiwan Broadcast News from National Taiwan Normal University, front-end processing of corpus, Mel-Frequency Cepstrum Coefficient (MFCC), initial acoustic model training, which is trained by HTK and based on Maximum Likelihood, establishment of lexicon and language model, which is trained by SRILM and based on the corpus of text collected by Central News Agency, speech recognition decoder, TTK from speech lab of National Taiwan University. The baseline of experiment is the recognition result of decoding of initial acoustic model. The third part is Minimum Phone Error. First the theory and optimization deriving process of objective function is introduced to obtain the model parameters updating formula. Then, the calculation method in the implementation of statistics in the model parameters updating formula is introduced. It includes the definition of accuracy, the word arc accuracy and the expectation of word graph accuracy. The experiment results of Minimum Phone Error have about 2.4% improvement in character accuracy. Fourth part introduces the modifications of Minimum Phone Error, including Minimum Phone Frame Error (MPFE) training method, physical state level Minimum Bayes Risk (sMBR) training method and Minimum Divergence (MD) training method. The main difference between these methods is the definition of accuracy in the objective function. The experiment results of the 4 methods including MPE show that all training methods have improvement in word and character accuracy except Minimum Divergence. MPE has the best character accuracy, and MPFE has the best word accuracy. In addition, this thesis has further improvements on he definition of accuracy in the objective functions: adding Error Penalty and Normalization of phone length in accuracy. The experimental results show that these modifications improve character accuracy but deprave word accuracy. Fifth part introduces data selection based on the accuracy expectations of word arc. The target is to select more discriminative word arcs into training. This method has experiments on MPE and one modification version of MPFE. The experimental results show that data selection has not great effects on accuracy, but can speed up the training convergence rate.

並列關鍵字

large vocabulary mandarin speech recognition ； discriminative training ； minimum phone error ； minimum phone frame error ； minimum bayes risk ； minimum divergence

參考文獻

【36】蔡明怡，『國語語音之發音變異分析及提昇辨識效能之發音模型』，博士論文，國立台灣大學電信工程研究所，2006

【21】 H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S Cheng “MATBN: A Mandarin Chinese Broadcast News Corpus,” Interational Journal of Computational Linguistics and Chinese Language Processing, 2005

【28】郭人瑋，『最小化音素錯誤鑑別式聲學模型學習於中文大詞彙連續語音辨識之初步研究』，碩士論文，國立台灣師範大學資訊工程研究所，2005

【41】朱芳輝，『資料選取方法於鑑別式聲學模型訓練之研究』，碩士論文，國立台灣師範大學資訊工程研究所，2008

【29】陳佳妤，『最小化音素錯誤模型及特徵訓練法於中文大詞彙辨識上之初步研究』，碩士論文，國立台灣大學電信工程研究所，2006

國際替代計量

最小音素錯誤訓練法及其改進方法在國語大字彙辨識上之評估與分析

主題瀏覽