結合韻律特徵與聲學特徵於錯誤發音檢測與診斷之研究

本論文探討韻律特徵應用多任務深層網路模型於錯誤發音檢測及診斷(mispronunciation detection and diagnosis, MDD)之研究。電腦輔助發音訓練(computer assisted pronunciation training, CAPT)之目的在於透過電腦自動地指正外語學習者的發音問題；其在程序上大致可分為錯誤發音檢測(mispronunciation detection)與錯誤發音診斷(mispronunciation diagnosis)等兩個階段。本論文主要探討 1.)韻律特徵與聲學特徵結合後對於錯誤發音檢測與診斷的幫助。 2.)希望利用多任務深層網路模型解決資料正例反例不平衡之問題。 3.)結合基於相似度的評分(likelihood-based scoring,GOP)以及基於分類器評分(classification-based scoring)的方法達到更好的檢測結果以及診斷結果。實驗結果顯示，聲學特徵對於錯誤發音檢測任務較有幫助；而韻律特徵對錯誤發音診斷任務有較好的助益。

關鍵字

電腦輔助發音訓練；多任務學習；自動語音辨識；錯誤發音檢測；錯誤發音診斷；韻律特徵；深層類神經網路

並列摘要

The main idea of this thesis is to discuss the assists of the multi-task deep neural network model and prosody characteristics in mispronunciation detection and diagnosis (MDD). The purpose of computer assisted pronunciation training (CAPT) is to help second-language (L2) learners automatically correcting the mistaken pronunciation. Computer assisted pronunciation training can be divided into mispronunciation detection and mispronunciation diagnosis. This paper mainly focuses on three aspects. First, we explore the benefits using the combined features of prosodic and phonetic characteristic in mispronunciation detection and diagnosis task. Second, we use multi-task learning models to help solving the data unbalanced problem. Last but not least, we combine likelihood-based scoring (GOP) method and classification-based scoring method in order to achieve better detection and diagnosis results. The result of experiments shows that phonetic features work better when we need to detect the mispronunciation. On the contrary, prosodic features are more helpful to mispronunciation diagnosis task.

並列關鍵字

computer assisted pronunciation training ； mispronunciation detection ； mispronunciation diagnosis ； acoustic models ； deep neural networks ； multi-task learning ； prosodic features

參考文獻

[Atal, 1974] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” The Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304–1312, 1974.

Google Scholar

[Bergstra et al., 2010] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. W. Farley and Y. Bengio. “Theano: A CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference, 2010.

Google Scholar

[Bishop, 2006] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

Google Scholar

[Black et al., 2015] M. P. Black, D. Bone, Z. I. Skordilis, R. Gupta, W. Xia, P. Papadopoulos, S. N. Chakravarthula, B. Xiao, M. V. Segbroeck, J. Kim, P. G. Georgiou and S. S. Narayanan, ”Automated evaluation of non-native English pronunciation quality: combining knowledge- and data-driven features at multiple time scales,” in Proceedings of the International Conference on Speech Communication and Technology, 2015.

Google Scholar

[Brefeld et al., 2005] U. Brefeld, C. Buscher and T. Scheffer, “Multiview dicriminative sequential learning,” in Proceedings of the European Conference on Machine Learning, 2005.

Google Scholar

國際替代計量

結合韻律特徵與聲學特徵於錯誤發音檢測與診斷之研究

主題瀏覽