使用深度學習以改善語音評分之方法與比較

語句發音標準與否是溝通上重要的一環，也與欲表達的意義有著密不可分的關係，不同但類似的發音可能代表了不同的含意，因此，發音的標準性在語言學習中有其重要的地位。本論文中共分為兩大部份，分別為利用類神經網路模型來分類音素以及利用類神經網路模型的分類結果來進行英語語句的評分，並建立一套以類神經網路模型為基底的英文語音評分系統，藉以達到電腦輔助語言學習之功用。類神經網路及深度學習部分，本論文利用了MFCC特徵及filter-bank特徵來比較其在深度學習中的效果，同時也測試了多種類神經網路的參數組合，在找出對於訓練資料集比較合適的參數組合後便會以大維度特徵來進一步實驗，最終的實驗結果以使用了MFCC的大維度特徵為最好，其類神經網路模型的音素辨識率可達73.33 %。語音評分的部分，本論文以HMM-GMM為基底的語音評分系統來當作比較及改善的對像，本論文提出了max-gap評分方法與adaptive-k評分方法以利用類神經網路模型的輸出結果來進行語音評分。在語音評分上的測試結果顯示，adaptive-k評分方法相較於以HMM-GMM為基底的語音評分系統在短句評分中有較好的表現，但在長句評分中仍待改善，整體而言，adaptive-k評分方法相對於以HMM-GMM為基底的語音評分系統仍有所改進。

關鍵字

類神經網路；語音評分；發音評分；電腦輔助語言學習；口說發音輔助學習

並列摘要

Pronunciation plays an important role in communication. Similar but different pronunciations may lead to different meanings. Therefore, correct pronunciation is a very important part of language learning. The thesis is divided into two parts. The first part describes the use of deep neural networks (DNN) to classify phonemes. The second part explain how we can use the DNN output to perform speech assessment. Building a DNN-based speech assessment system is the main goal of this thesis. In terms of the use of DNN, we have compared the features of MFCC and Mel-filter bank coefficients. Moreover, we have tried a number of DNN configurations in order to find the best setting. Our main finding is that large-dimension features can give better accuracy. In our experiments, the best recognition rate of DNN models can be as high as 73.33% using large-dimension MFCC features. In terms of speech assessment, we have proposed two methods, max-gap and adaptive-k, to use the DNN’s output for speech assessment. A conventional HMM-GMM based speech assessment system is regard as a baseline. Our experiments demonstrate that, adaptive-k outperforms HMM-GMM for short sentence assessment. For long sentences, adaptive-k and HMM-GMM have comparable performance. In general, adaptive-k is still better than HMM-GMM for speech assessment.

並列關鍵字

neural network ； speech assessment ； pronunciation scoring ； computer assisted language learning (CALL) ； computer assisted pronunciation training (CAPT)

參考文獻

[4] Stanford University, CS231n: Convolutional Neural Networks for Visual Recognition, Lecture 5 Slides, pp.52-64. available at "http://cs231n.stanford.edu/slides/winter1516_lecture5.pdf".

[5] Stanford University, CS231n: Convolutional Neural Networks for Visual Recognition, Note: Weight Initialization available at "http://cs231n.github.io/neural-networks-2/".

[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”, International Conference on Computer Vision (ICCV), 2015.

[9] Catia Cucchiarini, Nelmer Strik, and Lou Boves, “Automatic Evaluation of Dutch Pronunciation by Using Speech Recognition Technology”, Paper presented at the IEEE Automatic Speech Recognition and Understanding Workshop, 1997.

[13] 劉承泰，嵌入式語音命令系統的設計與改進，國立清華大學碩士論文，民國102年。

國際替代計量

使用深度學習以改善語音評分之方法與比較

全文下載

主題瀏覽