透過您的圖書館登入
IP:3.133.156.156
  • 學位論文

探討聲學模型化技術與半監督鑑別式訓練於語音辨識之研究

Investigating Acoustic Modeling and Semi-supervised Discriminative Training for Speech Recognition

指導教授 : 陳柏琳
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


近年來鑑別式訓練(Discriminative training)的目標函數Lattice-free maximum mutual information (LF-MMI)在自動語音辨識(Automatic speech recognition, ASR)的聲學模型(Acoustic model)訓練上取得重大的突破。儘管LF-MMI在監督式環境下斬獲最好的成果,然而在半監督式環境下的研究成果仍然有限。在常見的半監督式方法─自我訓練(Self-training)中,種子模型(Seed model)常因為語料有限而效果不佳。再者,因為LF-MMI屬於鑑別式訓練之故,較易受到標記正確與否的影響。基於上述,本論文將半監督式訓練拆解成兩個問題:1)如何提升種子模型的效能,以及2)如何利用未轉寫(無人工標記)語料。針對第一個問題,我們使用兩種方法可分別對應到是否具存有額外資料的情況,其一為遷移學習(Transfer learning),使用技術為權重遷移(Weight transfer)和多任務學習(Multitask learning);其二為模型合併(Model combination),使用技術為假說層級合併(Hypothesis-level combination)和音框層級合併(Frame-level combination)。針對第二個問題,基於LF-MMI目標函數,我們引入負條件熵(Negative conditional entropy, NCE)與保留更多假說空間的詞圖監督(Lattice for supervision)。在一系列於互動式會議語料(Augmented multi-party interaction, AMI)的實驗結果顯示,不論是利用領域外資料(Out-of-domain data, OOD)的遷移學習或多樣性互補的模型合併皆可提升種子模型的效能,而NCE與詞圖監督則能運用未轉寫語料降改善錯誤率(Word error rate, WER)與詞修復率(WER recovery rate, WRR)。

並列摘要


More recently, a novel objective function of discriminative acoustic model training, namely Lattice-free maximum mutual information (LF-MMI), has been proposed and achieved the new state-of-the-art in automatic speech recognition (ASR). Although LF-MMI shows excellent performance in various ASR tasks with supervised training settings, its performance is often significantly degraded when with semi-supervised settings. This is because LF-MMI shares a common deficiency of discriminative training criteria, being sensitive to the accuracy of the corresponding transcripts of training utterances. In view of the above, this thesis explores two questions to LF-MMI with a semi-supervised training setting: the first one is how to improve the seed model and the second one is how to use untranscribed training data. For the former, we investigate several transfer learning approaches (e.g. weight transfer and multitask learning) and the model combination (e.g. hypothesis-level combination and frame-level combination). The distinction between the above two methods is whether extra training data is being used or not. On the other hand, for the second question, we introduce negative conditional entropy (NCE) and lattice for supervision, in conjunction with the LF-MMI objective function. A series of experiments were conducted on the Augmented Multi-Party Interaction (AMI) benchmark corpus. The experimental results show that transfer learning using out-of-domain data (ODD) and model combination based on complementary diversity can effectively improve the performance of the seed model. The pairing of NCE and lattice for supervision can improve the word error rate (WER) and WER recovery rate (WRR).

參考文獻


[1] L. Rabiner and B.-H. Juang, “Fundamentals of speech recognition,” Englewood Cliffs: PTR Prentice Hall, vol. 14, 1993.
[2] L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol. 16, pp. 85-100, 2014.
[3] D. Fohr, O. Mella, and I. Illina, "New Paradigm in Speech Recognition: Deep Neural Networks," in Proc. ICIS, 2017.
[4] R. P. Lippmann, “Speech Recognition by machines and humans,” Speech Communication, vol. 22, no. 1, pp. 1-15, 1997.
[5] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” in Proc. ICASSP, 1996.

延伸閱讀