使用長短期記憶類神經網路建構中文語音辨識器之研究

近年來類神經網路(Neural network)被廣泛運用於語音辨識領域中，本論文以使用Kaldi speech recognition toolkit來實現遞迴式類神經網路(Recurrent Neural Network)聲學模型，並且建立中文大辭彙語音辨識系統。由於遞迴式類神經網路為循環式連接(Cyclic connections)，應用於時間序列訊號的模型化(Modeling)，較於傳統全連接(Full connection)的深層類神經網路更有益處。然而單純遞迴式類神經網路在訓練上隨著時間的遞迴有著梯度消失(Gradient vanishing)以及梯度爆炸(Gradient exploding)的問題，導致無法有效的捕捉到長期的記憶關聯，因此長短期記憶(Long Short-Term Memory, LSTM)是一個解決此問題的模型，本研究基於此模型架構結合了卷積神經網路(Convolutional Neural Network)及深層類神經網路(Deep Neural Network)建構出CLDNN模型。本研究使用了TCC300(24小時)、AIShell(162小時)、NER(111小時)作為訓練語料，後續加入語言模型建立大辭彙語音辨識系統，為了檢測系統強健度(Robustness)，測試語料使用TCC300(2.4小時，朗讀語速)、NER-clean(1.9小時，快語速，無雜訊)、NER-other(9小時，快語速，有雜訊)。

關鍵字

遞迴式類神經網路；長短期記憶；梯度消失(爆炸) ；聲學模型；中文；大辭彙語音辨識；卷積類神經網路；深層類神經網路

並列摘要

In recent years, neural networks have been widely used in the study of speech recognition. This paper uses Kaldi speech recognition toolkit to implement recurrent neural network acoustic models and establish a Mandarin speech recognition system. Since the recurrent neural networks have cyclic connections, they are more beneficial to modeling time sequence signals than the fully connected deep neural networks. However, the simple recurrent neural networks have the problems of gradient vanishing and gradient exploding over time, which is resulting in failure to capture long-term memory. Therefore, Long Short-Term Memory (LSTM) is a model to solve this problem. This study builds a CLDNN model based on this model architecture combined with convolutional neural network and deep neural network. In this study, TCC300 (24 hours), AIShell (162 hours), and NER (111hours) were used as training corpus, and then added a language model to establish a large vocabulary speech recognition system. In order to detect the robustness of the research system, the test corpus used TCC300 (2.4 hours, reading speed), NER-clean (1.9 hours, fast speed without noise), and NER-other (9 hours, fast speed with noise).

並列關鍵字

RNNs ； LSTMs ； gradient vanishing(exploding) ； acoustic model ； Mandarin ； LVCSR ； CNNs ； DNNs

參考文獻

[1] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motl´ıcek, Y. Qian, P. Schwarz, J. Silovsky´, G Stemmer, K. Vesely´, “The Kaldi speech recognition toolkit,” in IEEE ASRU, December 2011

Google Scholar

[2] “Mandarin Microphone Speech Corpus-TCC300,” [Online]. Available: http://www.aclclp.org.tw/use_mat_c.php#tcc300edu

Google Scholar

[3] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An open-source Mandarin speech corpus and a speech recognition baseline,” in Proc. Oriental COCOSDA, 2017.

Google Scholar

[4] L. R. Bahl, F. Jelinek, R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 5, pp. 179-190, Mar. 1983.

Google Scholar

[5] S. Tibrewala, H. Hermansky, “Multi-band and adaptation approaches to robust speech recognition,” in Proceedings of European Conference on Speech Communication and Technology, 25(1-3), pp.2619-2622, 1997

Google Scholar

國際替代計量

使用長短期記憶類神經網路建構中文語音辨識器之研究

全文下載

主題瀏覽