透過您的圖書館登入
IP:3.133.156.156
  • 學位論文

英語語音評分的研究和實作

Research and Implementation of English Speech Scoring

指導教授 : 張智星
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


此篇論文在探討電腦輔助發音訓練 (computer assisted pronunciation training),並聚焦於發音評分之改善方式。我們進行不同的實驗來研究評分表現的影響因素,如比較使用不同的聲學模型,語句前處理的差異等,此外我們提出加入發音時間資訊,即是發音時間的音框數量,能有助於改善錯誤發音偵測。在實作上我們使用時間延遲神經網絡 (time delay neural network)訓練之聲學模型,並根據強制對位 (forced alignment)及Goodness Of Pronunciation (GOP) \ Rank Ratio (RR)輸出一個語音分數(phonetic score)。同時利用訓練資料的發音時間資訊建立時間模型 (duration model),用以輸出懲罰分數 (penalty score)及時間分數 (duration score)。在語音評分階段,系統會基於語音分數、懲罰分數及時間分數給予一個發音分數。此篇論文實驗採用的測試語料為自行蒐集之英文參考詞彙 (english reference words)語料及MIR實驗室 (MIRLab)蒐集的MIR-SD (MIRLab - stress detection)語料,最終在等錯誤率 (equal error rate)指標上顯示結合語音分數與時間分數的表現比只使用語音分數來得好。

並列摘要


This paper presents our research on computer assisted pronunciation training (CAPT) and focuses on how to improve the reliability of pronunciation scoring. We conducted different experiments to study influential factors of scoring performance, such as comparison of the use of different acoustic models, differences in utterance pre-processing, etc. In addition, we propose the use of pronunciation time, namely number of frames of pronunciation, which can help to improve mispronunciation detection. We train an acoustic model based on time delay neural network (TDNN), and compute the phonetic score based on forced alignment and Goodness Of Pronunciation (GOP) \ Rank Ratio (RR). We also create a duration model for each phone based on its duration in the training data. The duration model can be used to generate penalty scores and duration scores for modifying the original timber scores. During test, the system returns a pronunciation score based on the phonetic score, the penalty score, and the duration score. The test corpus used in the experiment is the English Reference Words (ERW) collected by ourselves and the MIRLab-Stress Detection (MIR-SD) collected by MIRLab. The results of experiments show that combining the phonetic score and the duration score can achieve better performance than using the phonetic score only in term of equal error rate (EER).

參考文獻


[1]S. M. Witt and S. J. Young, “Phone-level pronunciation scoring and assessment forinteractive language learning,”Speech communication, vol. 30, no. 2-3, pp. 95–108,2000.
[2]M. Tu, A. Grabek, J. Liss, and V. Berisha, “Investigating the role of l1 in automaticpronunciation evaluation of l2 speech,”arXiv preprint arXiv:1807.01738, 2018.
[3]X.-B. Chen, Y.-T. Lee, H.-S. Lee, J.-S. R. Jang, and H.-M. Wang, “Mandarin mis-pronunciation detection and diagnosis feedback using articulatory attributes basedmulti-task learning,” Ph.D. dissertation, National Taiwan University, Taiwan, 2019.
[4]X. Qian, F. K. Soong, and H. Meng, “Discriminative acoustic model for improvingmispronunciation detection and diagnosis in computer-aided pronunciation training(capt),” inEleventh Annual Conference of the International Speech CommunicationAssociation, 2010.
[5]D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hanne-mann, P. Motlicek, Y. Qian, P. Schwarzetal., “The kaldi speech recognition toolkit,”inIEEE 2011 workshop on automatic speech recognition and understanding, no.CONF. IEEE Signal Processing Society, 2011.

延伸閱讀