透過您的圖書館登入
IP:13.58.244.216
  • 期刊
  • OpenAccess

結合鑑別式訓練與模型合併於半監督式語音辨識之研究

Leveraging Discriminative Training and Model Combination for Semi-supervised Speech Recognition

摘要


近年來鑑別式訓練(Discriminative training)的目標函數Lattice-free Maximum Mutual Information (LF-MMI)在自動語音辨識(Automatic speech recognition, ASR)上取得了重大的突破。儘管LF-MMI在監督式環境下斬獲最好的成果,然而在半監督式設定下,由於種子模型(Seed model)常因為語料有限而效果不佳。且由於LF-MMI屬於鑑別式訓練之故,易受到轉寫正確與否的影響。本論文利用兩種思路於半監督式訓練。其一,引入負條件熵(Negative conditional entropy, NCE)權重與詞圖(Lattice),前者是最小化詞圖路徑的條件熵(Conditional entropy),等同對MMI的參考轉寫(Reference transcript)做權重平均,權重的改變能自然地加入MMI訓練中,並同時對不確定性建模。其目的希望無信心過濾器(Confidence-based filter)也可訓練模型。後者加入詞圖,比起過往的只使用最佳辨識結果,可保留更多假說空間,進而提升找到參考轉寫(Reference transcript)的可能性;其二,我們借鑒整體學習(Ensemble learning)的概念,使用弱學習器(Weak learner)修正彼此的錯誤,分為假說層級合併(Hypothesis-level combination)和音框層級合併(Frame-level combination)。實驗結果顯示,加入NCE與詞圖皆能降低詞錯誤率(Word error rate, WER),而模型合併(Model combination)則能在各個階段顯著提升效能,且兩者結合可使詞修復率(WER recovery rate, WRR)達到60.8%。

並列摘要


In recent years, the so-called Lattice-free Maximum Mutual Information (LF-MMI) criterion has been proposed with good success for supervised training of state-of-the-art acoustic models in various automatic speech recognition (ASR) applications. However, when moving to the scenario of semi-supervised acoustic model training, the seed models of LF-MMI are often show inadequate competence due to limited available manually labeled training data. This is because LF-MMI shares a common deficiency of discriminative training criteria, being sensitive to the accuracy of the corresponding transcripts of training utterances. This paper sets out to explore two novel extensions of semi-supervised training in conjunction with LF-MMI. First, we capitalize more fully on negative conditional entropy (NCE) weighting and utilize word lattices for supervision in the semi-supervised setting. The former aims to minimize the conditional entropy of a lattice, which is equivalent to a weighted average of all possible reference transcripts. The minimization of the lattice entropy is a natural extension of the MMI objective for modeling uncertainty. The latter one, utilizing word lattices for supervision, manages to preserve more cues in the hypothesis space, by using word lattices instead of one-best results, to increase the possibility of finding reference transcripts of training utterances. Second, we draw on the notion stemming from ensemble learning to develop two disparate combination methods, namely hypothesis-level combination and frame-level combination. In doing so, the error-correcting capability of the acoustic models can be enhanced. The experimental results on a meeting transcription task show that the addition of NCE weighting, as well as the utilization of word lattices for supervision, can significantly reduce the word error rate (WER) of the ASR system, while the model combination approaches can also considerably improve the performance at various stages. Finally, fusion of the aforementioned two kinds of extensions can achieve a WER recovery rate (WRR) of 60.8%.

參考文獻


Bahl, L.,Brown, P.,de Souza, P.,Mercer, R.(1986).Maximum mutual information estimation of hidden markov model parameters for speech recognition.Proceedings of ICASSP 1986.(Proceedings of ICASSP 1986).
Chan, H. Y.,Woodland, P.(2004).Improving broadcast news transcription by lightly supervised discriminative training.Proceedings of ICASSP 2004.(Proceedings of ICASSP 2004).
Cui, X.,Huang, J.,Chien, J.-T.(2011).Multi-view and multiobjective semi-supervised learning for large vocabulary continuous speech recognition.Proceedings of ICASSP 2011.(Proceedings of ICASSP 2011).
Dahl, G. E.,Yu, D.,Deng, L.,Acero, A.(2012).Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition.IEEE Transactions on audio, speech, and language processing.20(1),30-42.
Deng, L.,Platt, J. C.(2014).Ensemble deep learning for speech recognition.Proceedings of Interspeech 2014.(Proceedings of Interspeech 2014).

延伸閱讀