NSYSU-MITLab團隊於福爾摩沙語音辨識競賽2020之語音辨識系統

本論文中，我們描述了NSYSU-MITLab團隊在福爾摩沙語音辨識競賽2020（Formosa Speech Recognition Challenge 2020, FSR-2020）中所實作的系統。我們使用多頭注意力機制（Multi-head Attention）所構成的Transformer架構建立了端到端的語音辨識系統，並且結合了連續性時序分類（Connectionist Temporal Classification, CTC）共同進行端到端的訓練以及解碼。我們也嘗試將編碼器更改為結合卷積神經網路（Convolutional neural network, CNN）與多頭注意力機制的Conformer架構。同時我們也建立了深度神經網路結合隱藏式馬可夫模型（Deep Neural Network-Hidden Markov Model, DNN-HMM），其中我們以時間限制自注意力機制（Time-Restricted Self-Attention, TRSA）及分解時延神經網路（Factorized Time Delay Neural Network, TDNN-F）建立深度神經網路的部分。最終我們在台文漢字任務上得到最佳的字元錯誤率（Character Error Rate, CER）為43.4%以及在台羅拼音任務上取得最佳的音節錯誤率（Syllable Error Rate, SER）25.4%。

關鍵字

自動語音辨識； Transformer ； Conformer ；連續性時序分類；聲學模型

並列摘要

In this paper, we describe the system team NSYSU-MITLab implemented for Formosa Speech Recognition Challenge 2020. We use the Transformer architecture composed of Multi-head Attention to construct an end-to-end speech recognition system and combine it with Connectionist Temporal Classification (CTC) for end-to-end training and decoding. We have also built a deep neural network combined with a hidden Markov model (DNN-HMM). We use Time-Restricted Self-Attention and Factorized Time Delay Neural Network (TDNN-F) for the deep neural network in DNN-HMM. The best performance we have achieved with the proposed methods is the character error rate of 45.5% for Taiwan Southern Min Recommended Characters (台文漢字) task and syllable error rate 25.4% for Taiwan Minnanyu Luomazi Pinyin (台羅拼音) task.

並列關鍵字

Automatic Speech Recognition ； Transformer ； Conformer ； Connectionist Temporal Classification ； Acoustic Model

參考文獻

Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. In Proceedings of the 34th International conference on machine learning, 933-941. https://doi.org/10.5555/3305381.3305478

Dong, L., Xu, S. & Xu, B. (2018). Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018), 5884-5888. https://doi.org/10.1109/ICASSP.2018.8462506

Graves, A., Fernández, S., Gomez, F. & Schmidhuber, J. (2006). Connectionist temporal classification: labelling unsegmented sequence data with recur-rent neural networks. In Proceedings of the 23rd international conference on Machine learning, 369-376. https://doi.org/10.1145/1143844.1143891

Karita, S., Soplin, N. E. Y., Watanabe, S., Delcroix, M., Ogawa, A. & Nakatani, T. (2019). Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration. In Proc. of Interspeech 2019, 1408-1412. https://doi.org/10.21437/Interspeech.2019-1938

Kürzinger, L., Winkelbauer, D., Li, L., Watzel, T., & Rigoll, G. (2020). CTC-segmentation of large corpora for German end-to-end speech recognition. In Proceedings of 22nd International Conference on Speech and Computer (SPECOM 2020), 267-278. https://doi.org/10.1007/978-3-030-60276-5_27

國際替代計量

NSYSU-MITLab團隊於福爾摩沙語音辨識競賽2020之語音辨識系統

全文下載

主題瀏覽