端對端語音辨識技術於電腦輔助發音訓練

電腦輔助發音系統(Computer-assisted pronunciation training, CAPT)，主要任務可分為錯誤發音檢測(Mispronunciation detection)以及錯誤發音診斷(Mispronunciation diagnosis)。而這兩種任務，在過去的研究中，主要依賴於傳統語音辨識系統的強制對齊(Forced alignment)方法，並利用強制對齊產生的音素(Phone)段落與觀測到的全部音素或較混淆的音素計算GOP(Goodness of pronunciation)分數。另一方面，由於端對端語音辨識(End-to-end speech recognition)框架簡化了傳統語音辨識所需要的步驟，因此基於此框架訓練的聲學模型(Acoustic model)在近年的研究上已成為熱門的研究議題，其主要作法分別為連結時序分類(Connectionist temporal classification, CTC)以及注意力模型(Attention model)。然而這樣的架構主要探討語音特徵對應到文字序列的正確性，較少探討音素層級的辨識。因此本論文希望藉由端對端語音辨識器探討發音檢測以及發音診斷的效果，並參考過去學者基於傳統聲學模型的研究，提出三種基於端對端聲學模型的發音檢測方法探討1. 基於語音辨識結果 2. 基於辨識產生的信心分數(Confidence score) 3. 利用信心分數結合N-best 語音辨識結果。另外，基於語音辨識結果以及利用信心分數結合N-best 語音辨識結果可同時完成發音診斷。在實驗中發現直接利用語音辨識結果進行發音檢測與診斷，得到的效果可超越以往兩階段藉由強制對齊計算GOP的發音檢測方法。

關鍵字

端對端語音辨識；連結時序分類；注意力模型；聲學模型；發音檢測；發音診斷

並列摘要

One of the primary tasks of a computer-assisted the pronunciation (CAPT) system is mispronunciation detection and diagnosis. Previous research on CAPT mostly relies on a forced-alignment procedure which is usually conducted with the acoustic models adopted from a traditional speech recognition system, in conjunction with a phoneme paragraph, to calculate the goodness of pronunciation (GOP) scores for the phonemes of spoken words with respect to a text prompt. On a separate front, the recently proposed end-to-end speech recognition architecture simplifies many of the training steps originally required for traditional speech recognition. As such, acoustic modeling based on this framework has become popular over the years, for which two predominant instantiations are the connectionist temporal classification (CTC) model and the attention-based model. However, current exploration of such an architecture is far more concerned with the correctness of mapping speech feature vectors to corresponding text sequences than its phone-level discriminating capability for subsequent applications like CAPT. In view of this, this thesis sets out to conduct mispronunciation detection and diagnosis on the strength of end-to-end speech recognition. To this end, we design and develop three mispronunciation detection methods: 1) method simply based speech recognition results; 2) method leveraging a recognition confidence measure; and 3) method combining the recognition confidence measure and N-best recognition results. It is remarkable that mispronunciation diagnosis can be simultaneously achieved through the joint use of the recognition confidence measure and the N-best recognition results. A series of experiments are conducted on a Mandarin mispronunciation detection and diagnosis task, which demonstrates that our method that jointly use the recognition confidence measure and the N-best recognition results obtained from end-to-end speech recognition can yield significantly better performance than a conventional two-stage method.

並列關鍵字

End-to-end speech recognition ； Connectionist temporal Classification ； Attention model ； Acoustic model ； Mispronunciation detection ； Mispronunciation diagnosis

參考文獻

[1] Lawrence R. Rabiner et al., “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, 1989.

Google Scholar

[2] Mark Gales and Steve Yang, “The Application of Hidden Markov Models in Speech Recognition,” Foundations and Trends® in Signal Processing, 2008.

Google Scholar

[3] Geoffrey Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, 2012.

Google Scholar

[4] Ossama Abdel-Hamid et al., “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, 2014.

Google Scholar

[5] Alex Graves et al., “Speech recognition with deep recurrent neural networks," ICASSP, 2013.

Google Scholar

國際替代計量

端對端語音辨識技術於電腦輔助發音訓練

主題瀏覽