  • 期刊
  • OpenAccess


An Investigation of Hybrid CTC-Attention Modeling in Mandarin Speech Recognition


近年來端對端(End-to-End)語音辨識的出現,簡化了許多傳統語音辨識的繁複流程。端對端語音辨識中,最主要的模型架構分別為連結時序分類(Connectionist Temporal Classification, CTC)與注意力模型(Attention Model)。本論文嘗試結合上述兩種模型架構(即CTC-Attention混合模型)於華語會議語音辨識之使用,以期能進一步提升語音辨識的效能。為此,我們分析模型結合時混合權重調整的影響,並進一步探究CTC-Attention混合模型對於短句的辨識效果。在中文會議語料的實驗結果顯示,相較於傳統語音辨識的TDNN-LFMMI模型,CTC-Attention混合模型在語句較短時,可具有較好的一般化能力(Generalization)。


The recent emergence of end-to-end automatic speech recognition (ASR) frameworks has streamlined the complicated modeling procedures of ASR systems in contrast to the conventional deep neural network-hidden Markov (DNN-HMM) ASR systems. Among the most popular end-to-end ASR approaches are the connectionist temporal classification (CTC) and the attention-based encoder-decoder model (Attention Model). In this paper, we explore the utility of combining CTC and the attention model in an attempt to yield better ASR performance. we also analyze the impact of the combination weight and the performance of the resulting CTC-Attention hybrid system on recognizing short utterances. Experiments on a Mandarin Chinese meeting corpus demonstrate that the CTC-Attention hybrid system delivers better performance on short utterance recognition in comparison to one of the state-of-the-art DNN-HMM settings, namely, the so-called TDNN-LFMMI system.


Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In Proceedings of ICASSP 2016. doi: 10.1109/ICASSP.2016.7472621
Gales, M. & Yang, S. (2008). The Application of Hidden Markov Models in Speech Recognition. Foundations and Trends® in Signal Processing, 1(3), 195-304. doi: 10.1561/2000000004
Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual prediction with LSTM. In Proceedings of ICANN 1999. doi: 10.1049/cp:19991218
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of ICML 2006, 369-376. doi: 10.1145/1143844.1143891
Graves, A., Jaitly, N., & Mohamed, A.-r. (2013). Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of ASRU 2013. doi: 10.1109/ASRU.2013.6707742
