透過您的圖書館登入
IP:52.15.59.163
  • 學位論文

利用多任務學習模型建立發音特徵來改善華語錯誤發音偵測與診斷之回饋

Mandarin Mispronunciation Detection and Diagnosis Feedback Using Articulatory Attributes Based Multi-task Learning

指導教授 : 張智星
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


此篇論文在探討電腦輔助發音訓練,我們聚焦於錯誤發音偵測以及提供與口腔模型相關的回饋。我們提出加入發音特徵(speech attributes),像是發音部位-發音方式 (place and manner),能有助於改善錯誤發音偵測與更為精準地提供錯誤發音診斷。實作上我們利用時間延遲神經網路(time-delay neural networks)並採用多任務學習策略(multi-task learning strategy)訓練了具鑑別力的發音模型並能輸出一個音素的發音分數(articulatory score),以及利用時間延遲神經網路訓練聲學模型並能輸出一個音素分數(phonetic score)。在測試階段,系統會基於發音分數與音素分數偵測發音錯誤並且給予一個精準的發音改進回饋。此論文實驗採用的語料為公視國語新聞廣播節目 (MATBN),並利用equal error rate (EER)、diagnosis accuracy (DA)來顯示深度類神經網路-隱式馬可夫模型(DNN-HMM)的表現比高斯混合模型-隱式馬可夫模型(GMM-HMM)來得好。除此之外,我們提出的方法能適用於各種語言,但此篇論文著重於華語的探討。

並列摘要


This paper presents our research on computer assisted pronunciation training (CAPT). We focus on mispronunciation detection and articulation feedback. We propose taking into account the speech attributes, namely place and manner of articulation, in the assessment models to improve mispronunciation detection and return precise articulation feedback to learners. We train a discriminative articulatory model based on time-delay neural networks (TDNNs) with the multi-task learning strategy to give the articulatory score and a TDNN-based acoustic model to give the phonetic score. In testing, the system detects mispronunciations and returns precise articulation feedback based on both the phonetic and articulatory scores. The results of experiments conducted on the MATBN Mandarin Chinese broadcast news corpus show that the proposed models outperform the Gaussian mixture model (GMM)-based and deep neural network (DNN)-based baselines in terms of equal error rate (EER) and diagnostic accuracy (DA). Furthermore, our mispronunciation detection system should work in any language, although the current system focuses on Mandarin.

參考文獻


[1] W Menzel, D Herron, P Bonaventura, and R Morton. Automatic detection and correction of non-native English pronunciation. In Proc. InSTIL, 2000.
[2] Stephanie Seneff, Chau Wang, and Julia Zhang. Spoken Conversational Interaction for Language Learning. In Proc. InSTIL/ICALL, 2004.
[3] Helmer Strik, Jozef Colpaert, Joost van Doremalen, and Catia Cucchiarini. The DISCO ASR-based CALL System: Practicing L2 Oral Skills and Beyond. In Proc. LREC, 2012.
[4] Xiaojun Qian, Helen Meng, and Frank Soong. A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training. IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, 24(6):1020–
1028, 2016.

延伸閱讀