透過您的圖書館登入
IP:3.15.221.67
  • 學位論文

多編碼器端到端模型於英語錯誤發音檢測與診斷

Multi-Encoder based End-to-End Model for English Mispronunciation Detection and Diagnosis

指導教授 : 陳柏琳

摘要


隨著全球化的加速,大多數人需要學習第二語言(Second language, L2),相較之下,語言教師的人數增長卻無法跟上語言學習的需求。因此越來越多研究著重在電腦輔助發音訓練(Computer-assisted pronunciation training, CAPT),嘗試利用電腦輔助學習者做更方便且有效的學習。在 CAPT 中,最重要的模組為以自動語音辨識(Automatic speech recognition, ASR)為核心技術的錯誤發音和診斷(Mispronunciation detection and diagnosis, MD&D)。然而,現有 MD&D 模型仍面臨兩個問題:一、任務不匹配。純語音辨識任務並未充分利用提示文本(Text prompt)於訓練階段。二、口音多樣性。第二語言學習者具有特殊的發音習慣,該習慣的聲學或語言特性會導致模型效能辨識困難。基於上述兩個問題,本研究提出兩個解決方向於端對端 MD&D 模型 (End-to-end MD&D, E2E MD&D)。首先,我們使用不同細粒度(音素與字元)的文本提示進行輸入增強,使 E2E ASR 更適合 MD&D 任務。其次,我們設計兩種不同面向的口音感知模塊,提示模型口音資訊以及消除口音資訊,嘗試減輕口音多樣性於 E2E MD&D 系統的影響。實驗結果表明,在公開二語語料庫 L2-ARCTIC 上,我們提出 MD&D 模型具有明顯的優勢與有效性。

並列摘要


With the acceleration of globalization, most people need to learn a second language (L2). In contrast, the increase in the number of language teachers cannot keep up with the demand for language learning. Therefore, more and more researches focus on computer-assisted pronunciation training (CAPT), trying to use computers to assist learners do more convenient and effective learning. In CAPT, the most important module is mispronunciation detection and diagnosis (MD&D) with automatic speech recognition (ASR) as the core technology. However, the existing MD&D model still facing two problems. First, the task does not match. The pure ASR task does not make full use of the text prompt in the training phase. Second, diversity of accents. L2 learners have special pronunciation habits, and the acoustic or linguistic characteristics of this habit will make it difficult to identify the effectiveness of the model. Based on the above two problems, this research proposes two solutions to the end-to-end MD&D model (E2E MD&D). First, we use different fine-grained (phoneme and character) text prompts for input augmentation, making E2E ASR more suitable for MD&D tasks. Second, we designed two different accent perception modules, prompting model accent information and eliminating accent information, trying to reduce the impact of accent diversity on the E2E MD&D system. The experiment results shown that our proposed MD&D model has advantages and effectiveness on the public L2 corpus L2-ARCTIC.

參考文獻


[Atal, 1974] B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304-1312, 1974.
[Bahdanau et al., 2014] D. Bahdanau, K. Cho and Y. Bengio, "Neural machine translation by jointly learning to align and translate," in arXiv, 2014.
[Chen et al., 2018] L. Chen, J. Tao, S. Ghaffarzadegan and Y. Qian, "End-to-end neural network based automated speech scoring," in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2018.
[Chiu and Chen, 2021] S. H. Chiu and B. Chen, "Innovative BERT-based reranking language models for speech recognition," in Proceedings of the Spoken Language Technology Workshop, 2021.
[Chorowski et al., 2015] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho and Y. Bengio, "Attention-based models for speech recognition," in arXiv, 2015.

延伸閱讀