透過您的圖書館登入
IP:3.15.25.32
  • 期刊
  • OpenAccess

結合鑑別式訓練聲學模型之類神經網路架構及優化方法的改進

Leveraging Discriminative Training and Improved Neural Network Architecture and Optimization Method

摘要


本論文探討聲學模型上的改進對於大詞彙連續中文語音辨識的影響。在基礎聲學模型的訓練上,有別於以往語音辨識通常使用交互熵(Cross Entropy)作為深度類神經網路目標函數,我們使用Lattice-free Maximum Mutual Information(LF-MMI)做為序列式鑑別訓練的目標函數。LF-MMI使得能夠藉由圖形處理器(Graphical Processing Unit, GPU)上快速地進行前向後向運算,並且找出所有可能路徑的後驗機率,省去傳統鑑別式訓練前需要提前生成詞圖(Word Lattices)的步驟。針對這樣的訓練方式,類神經網路的部分通常使用所謂的時間延遲類神經網路(Time-Delay Neural Network, TDNN)做為聲學模型可達到不錯的辨識效果。因此,本篇論文將基於TDNN模型加深類神經網路層數,並藉由半正交低秩矩陣分解使得深層類神經網路訓練過程更加穩定。另一方面,為了增加模型的一般化能力(Generalization Ability),我們使用來回針法(Backstitch)的優化算法。在中文廣播新聞的辨識任務顯示,上述兩種改進方法的結合能讓TDNN-LF-MMI的模型在字錯誤率(Character Error Rate, CER)有相當顯著的降低。

並列摘要


This paper sets out to investigate the effect of acoustic modeling on Mandarin large vocabulary continuous speech recognition (LVCSR). In order to obtain more discriminative baseline acoustic models, we adopt the recently proposed lattice-free maximum mutual information (LF-MMI) criterion as the objective for sequential training of component neural networks in replace of the conventional cross entropy criterion. LF-MMI brings the benefit of efficient forward-backward statistics accumulation on top of the graphical processing unit (GPU) for all hypothesized word sequences without the need of an explicit word lattice generation process. Paired with LF-MMI, the component neural networks of acoustic models implemented with the so-called time-delay neural network (TDNN) often lead to impressive performance. In view of the above, we explore an integration of two novel extensions of acoustic modeling. One is to conduct semi-orthogonal low-rank matrix factorization on the TDNN-based acoustic models with deeper network layers to increase their robustness. The other is to integrate the backstitch mechanism into the update process of acoustic models for promoting the level of generalization. Extensive experiments carried out on a Mandarin broadcast news transcription task reveal that the integration of these two novel extensions of acoustic modeling can yield considerably improvements over the baseline LF-MMI in terms of character error rate (CER) reduction.

參考文獻


Ba, J.,Rich, C.(2014).Do deep nets really need to be deep?.Proceedings of NIPS 2014.(Proceedings of NIPS 2014).
Bahl, L.,Brown, P.,de Souza, P.,Mercer, R.(1986).Maximum mutual information estimation of hidden markov model parameters for speech recognition.Proceedings of ICASSP 1986.(Proceedings of ICASSP 1986).
Gales, M.,Yang, S.(2008).The application of hidden markov models in speech recognition.Foundations and Trends® in Signal Processing.1(3),195-304.
Graves, A.,Fernández, S.,Gomez, F.,Schmidhuber, J.(2006).Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks.Proceedings of ICML '06.(Proceedings of ICML '06).
Graves, A.,Mohamed, A.-r.,Hinton, G. E.(2013).Speech recognition with deep recurrent neural networks.Proceedings of ICASSP 2013.(Proceedings of ICASSP 2013).

延伸閱讀