透過您的圖書館登入
IP:3.17.150.89
  • 學位論文

使用基於發音方式與位置的多任務學習來改進華語大詞彙語音辨識

Improving Mandarin LVCSR Using Place and Manner Based Multi-task Learning

指導教授 : 張智星
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在大詞彙語音辨識的領域中,以DNN-HMM取代GMM-HMM作為聲學模型效果已經有顯著提升。本篇論文使用多任務學習的神經網路模型(multi-task learning,MTL-DNN),除了主要的senone分類之外,我們以發音方式與位置的發音特徵,作為子任務來同時訓練DNN模型,使辨識結果效果提升。相較於前人的研究,我們提出三個改進方法,第一是將發音特徵的標籤分為四個區塊,每個區塊內的特徵彼此互斥,以取代傳統多重標籤(multi-label)的方式,作為子任務的輸出層來訓練MTL-TDNN模型。第二是以時延神經網路(time-delay neural networks,TDNN)來取代傳統神經網路。TDNN的特性可以將較多的前後文資訊加入訓練,第三是將子任務的輸出層接到較底層的隱藏層。實驗的語料為中文廣播新聞語料庫(MATBN),分為小資料集MATBN-20與大資料集MATBN-200,評估方式為字符錯誤率(character error rate,CER),與傳統單任務的TDNN模型做比較,最好的模型在MATBN-20與MATBN-200的相對進步幅度為3.33%與1%。

並列摘要


In large vocabulary continuous speech recognition (LVCSR), it is well known that the recognition performance has been improved by using DNN-HMM instead of GMM-HMM. In this thesis, we use multi-task learning model (MTL-DNN), aiming at simultaneously minimizing the cross-entropy losses with respect to the output scores of senones and articulatory attributes, such as place and manner. The proposed framework has three novelties when compared with previous studies. First, the subtasks designed for articulation classification assure that all attributes are mutually exclusive. Second, instead of fully-connected multilayer perceptrons, the well-known structure of time-delay neural networks is adopted to efficiently model long temporal contexts. Finally, in the proposed MTL-TDNN architecture, layer-wise neuron sharing of subtasks only occurs in the first few layers. We performed experiments on the Mandarin Chinese broadcast news corpus (MATBN), including a small dataset (MATBN-20) and a large dataset (MATBN-200). Compared with the conventional single-task learning TDNN model, the experiments show that the proposed framework achieves relative character error rate (CER) reductions of 3.3\% and 1\% on the small and big datasets, respectively.

參考文獻


[1] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme Recognition Using Time-Delay Neural Networks,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328–339, 1989.
[2] V. Peddinti, D. Povey, and S. Khudanpur, “A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts,” in Proc. Interspeech, 2015.
[3] H. Zheng, Z. Yang, L. Qiao, J. Li, and W. Liu, “Attribute Knowledge Integration for Speech Recognition Based on Multi-task Learning Neural Networks,” in Proc. Interspeech, 2015.
[4] G. Hinton, L. Deng, D. Yu, G. Dahl, A. R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[5] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.

延伸閱讀