透過您的圖書館登入
IP:3.147.49.182
  • 學位論文

端對端語音辨識技術於電腦輔助發音訓練

Computer-assisted Pronunciation Training Leveraging End-to-End Speech Recognition Techniques

指導教授 : 陳柏琳
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


電腦輔助發音系統(Computer-assisted pronunciation training, CAPT),主要任務可分為錯誤發音檢測(Mispronunciation detection)以及錯誤發音診斷(Mispronunciation diagnosis)。而這兩種任務,在過去的研究中,主要依賴於傳統語音辨識系統的強制對齊(Forced alignment)方法,並利用強制對齊產生的音素(Phone)段落與觀測到的全部音素或較混淆的音素計算GOP(Goodness of pronunciation)分數。另一方面,由於端對端語音辨識(End-to-end speech recognition)框架簡化了傳統語音辨識所需要的步驟,因此基於此框架訓練的聲學模型(Acoustic model)在近年的研究上已成為熱門的研究議題,其主要作法分別為連結時序分類(Connectionist temporal classification, CTC)以及注意力模型(Attention model)。然而這樣的架構主要探討語音特徵對應到文字序列的正確性,較少探討音素層級的辨識。因此本論文希望藉由端對端語音辨識器探討發音檢測以及發音診斷的效果,並參考過去學者基於傳統聲學模型的研究,提出三種基於端對端聲學模型的發音檢測方法探討1. 基於語音辨識結果 2. 基於辨識產生的信心分數(Confidence score) 3. 利用信心分數結合N-best 語音辨識結果。另外,基於語音辨識結果以及利用信心分數結合N-best 語音辨識結果可同時完成發音診斷。在實驗中發現直接利用語音辨識結果進行發音檢測與診斷,得到的效果可超越以往兩階段藉由強制對齊計算GOP的發音檢測方法。

並列摘要


One of the primary tasks of a computer-assisted the pronunciation (CAPT) system is mispronunciation detection and diagnosis. Previous research on CAPT mostly relies on a forced-alignment procedure which is usually conducted with the acoustic models adopted from a traditional speech recognition system, in conjunction with a phoneme paragraph, to calculate the goodness of pronunciation (GOP) scores for the phonemes of spoken words with respect to a text prompt. On a separate front, the recently proposed end-to-end speech recognition architecture simplifies many of the training steps originally required for traditional speech recognition. As such, acoustic modeling based on this framework has become popular over the years, for which two predominant instantiations are the connectionist temporal classification (CTC) model and the attention-based model. However, current exploration of such an architecture is far more concerned with the correctness of mapping speech feature vectors to corresponding text sequences than its phone-level discriminating capability for subsequent applications like CAPT. In view of this, this thesis sets out to conduct mispronunciation detection and diagnosis on the strength of end-to-end speech recognition. To this end, we design and develop three mispronunciation detection methods: 1) method simply based speech recognition results; 2) method leveraging a recognition confidence measure; and 3) method combining the recognition confidence measure and N-best recognition results. It is remarkable that mispronunciation diagnosis can be simultaneously achieved through the joint use of the recognition confidence measure and the N-best recognition results. A series of experiments are conducted on a Mandarin mispronunciation detection and diagnosis task, which demonstrates that our method that jointly use the recognition confidence measure and the N-best recognition results obtained from end-to-end speech recognition can yield significantly better performance than a conventional two-stage method.

參考文獻


[1] Lawrence R. Rabiner et al., “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, 1989.
[2] Mark Gales and Steve Yang, “The Application of Hidden Markov Models in Speech Recognition,” Foundations and Trends® in Signal Processing, 2008.
[3] Geoffrey Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups,” IEEE Signal processing magazine, 2012.
[4] Ossama Abdel-Hamid et al., “Convolutional neural networks for speech recognition,” IEEE/ACM Transactions on audio, speech, and language processing, 2014.
[5] Alex Graves et al., “Speech recognition with deep recurrent neural networks," ICASSP, 2013.

延伸閱讀