改善鑑別式聲學模型訓練於中文連續語音辨識之研究

本論文探討改善鑑別式聲學模型於中文大詞彙連續語音辨識之研究。首先，本論文提出一個新的時間音框層次音素正確率函數來取代最小化音素錯誤訓練的原始音素正確率函數，此新的音素正確率函數在某種程度上能充分地懲罰刪除錯誤。其次，本論文提出一個新的以時間音框層次正規化熵值為基礎的資料選取方法來改進鑑別式訓練，其正規化熵值是由訓練語料所產生之詞圖中高斯分布之事後機率所求得。此資料選取方法可以讓鑑別式訓練更集中在那些離決定邊界較近的訓練樣本所收集的統計值，以達到較佳的鑑別力。此資料選取方法更進一步地應用到非監督鑑別式聲學模型訓練上。最後，本論文也嘗試修改鑑別式訓練的目標函數，以收集不同的統計值來改進最小化音素錯誤鑑別式訓練。所使用的實驗題材是公視新聞語料。由初步的實驗結果來看，結合時間音框層次的資料選取方法和新的音素正確率函數在前幾次的迭代訓練中確實有些微且一致的進步。

關鍵字

鑑別式聲學模型訓練；大詞彙連續語音辨識；時間音框正確率函數；資料選取

並列摘要

This thesis considers improved discriminative training of acoustic models for Mandarin large vocabulary continuous speech recognition (LVCSR). First, we presented a new phone accuracy function based on the frame-level accuracy of hypothesized phone arcs instead of using the raw phone accuracy function of minimum phone error (MPE) training, which to some extent can sufficiently penalize deletion errors of speech recognition. Second, a novel data selection approach based on the normalized frame-level entropy of Gaussian posterior probabilities obtained from the word lattice of the training utterance was explored for discriminative training. It has the merit of making the training algorithm focus much more on the training statistics of those frame samples that center nearly around the decision boundary for better discrimination. The proposed data selection approach was further applied to unsupervised discriminative training of acoustic models. Finally, a few other modifications of the training objective functions, as well as the lattice structures, for the accumulation of MPE training statistics were investigated. Experiments conducted on the Mandarin broadcast news corpus (MATBN) collected in Taiwan showed that the integration of the frame-level data selection and new phone accuracy function could achieve slight but consistent improvements over the conventional MPE training at lower training iterations.

並列關鍵字

Discriminative training ； Large vocabulary continuous speech recognition ； time frame accuracy function ； data selection

參考文獻

[Kuo et al. 2006] Jen-Wei Kuo, Shih-Hung Liu, Hsin-min Wang, Berlin Chen, "An Empirical Study of Word Error Minimization Approaches for Mandarin Large Vocabulary Speech Recognition," International Journal of Computational Linguistics & Chinese Language Processing, Vol. 11, No. 3, 2006

[Wang et al. 2005] Hsin-min Wang, Berlin Chen, Jen-Wei Kuo and Shih-Sian Cheng, "MATBN: A Mandarin Chinese Broadcast News Corpus," International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 2, 2005

[Atal 1974] B. S. Atal, “Effectiveness of Linear Prediction Characteristics of The Speech Wave for Automatic Speaker Identification and Verification,” Journal of the Acoustical Society of America, Vol. 55, No. 6, pp.1304-1312, 1974

[Aubert 2002] X. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 16, pp. 89-114, 2002

[Bahl et al. 1983] Lalit R. Bahl, F. Jelinek and Robert L. Mercer (1983). “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-5, no.2, March 1983.

被引用紀錄

朱芳輝（2007）。資料選取方法於鑑別式聲學模型訓練之研究〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-0204200815535282

李鴻欣（2009）。基於分類錯誤之線性鑑別式特徵轉換應用於大詞彙連續語音辨識〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315172539

羅永典（2010）。使用邊際資訊於鑑別式聲學模型訓練〔碩士論文，國立臺灣師範大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0021-1610201315213941

國際替代計量

改善鑑別式聲學模型訓練於中文連續語音辨識之研究

主題瀏覽