本論文首先提出一套自動化華語發音學習的演算法及其雛型展示系統。本系統使用了隱藏式馬可夫模型(hidden Markov models)的強制對位來切割每一個音素,並計算對應之聲學模型對數機率,以進行以排名為基準的信心度計算。 接著再把每一單音節的音高資料以高斯混合模型(Gaussian mixture models)來進行訓練,以便進行聲調辨識。我們也針對標準語句和測試語句計算了強度和節奏的相似度分數。 以音素、聲調、強度、節奏的四個分數函數,都是以參數化函數來表示,而最後的總分數,則是由音素、聲調、強度、節奏等四個評分函數的線性組合來決定。 由於整體分數牽涉到線性和非線性參數,我們使用了下坡式Simplex搜尋來微調這些參數以逼近人為主觀評分。實驗結果顯示,本系統的計算結果和人為主觀評鑑具有高度一致性。 更進一步的,在此發音學習的研究中可以發現,聲調對聲調語言的發音與識別是基本而重要的,聲調的辨識正確與否很大地影響了發音評估的好壞,因此我們也在此提出改進聲調辨識的創新方法。在聲調的辨識研究中,前人的工作大多採用兩階段處理,先以聲學模型對句子以強迫對應方式切割出音節,再使用眾多分類方法如類神經網路、高斯混合模型、隱藏式馬可夫模型和支撐向量機(support vector machines)等等,對切割好的音節訓練聲調模型。然而,強迫對應並不保證有人為判斷般的精準音素邊界,使得聲調模型的效能可能因為有聲範圍的判斷不佳而降低。為降低此一問題的影響,我們提出了一套強健化的,以隱藏馬可夫模型為基礎的連續語音聲調辨識方法,稱之為TRUES (tone recognition using extended segments)。這套方法對整個語句取出AMDF (average magnitude difference function)時域特徵,再以動態程式最佳化的方法,擷取出整句連續而不中斷的音高特徵曲線。每個音節的音高曲線在左右進行延伸後,訓練出左右本文均相關的聲調模型,期使增加聲調有用特徵和模型鑑別性,並減少切音結果對聲調模型的衝擊。實驗結果指出,吾人提出的TRUES這套方法,在我們自行錄製的唐詩語料下,對2007年新提出的supratone model而言,在辨識率上相對減少了 49.13% 的錯誤;而在我們實際的測試中,supratone model甚至已比新近的相關研究來得好了。此令人振奮的結果顯示出我們所提TURES方法的強健性和效果,也展現了以動態程序為基礎, 吾人所建議的整句不中斷音高追蹤法的優點。
This dissertation firstly presents the algorithms used in a prototypical software system for automatic pronunciation assessment of Mandarin Chinese. The system uses forced alignment of HMM (hidden Markov models) for identifying each syllable and the corresponding log probability for phoneme assessment, through a ranking-based confidence measure. The pitch vector of each syllable is then sent to a GMM (Gaussian mixture models) for tone recognition and assessment. We also compute the similarity of scores for intensity and rhythm between the target and test utterances. All the four scores for phoneme, tone, intensity, and rhythm are parametric functions with certain free parameters. The overall scoring function was then formulated as a linear combination of these four scoring functions of phoneme, tone, intensity, and rhythm. Since there are both linear and nonlinear parameters involved in the overall scoring function, we employ the downhill Simplex search to fine-tune these parameters in order to approximate the scoring results obtained from a human expert. The experimental results demonstrate that the system can give consistent scores that are close to those of a human’s subjective evaluation. Moreover, in the experimental results of pronunciation assessment, tone recognition has been a basic but important criterion for speech recognition/assessment of tonal languages, such as Mandarin Chinese. Most previously proposed approaches adopt a two-step approach where syllables within an utterance are identified via forced alignment first, and tone recognition using a variety of classifiers, such as neural networks, GMM, HMM, SVM (support vector machines), is then performed on each segmented syllable to predict its tone. However, forced alignment does not always generate accurate syllable boundaries, leading to unstable voiced-unvoiced detection and deteriorating performance in tone recognition. Aiming to alleviate this problem, we propose a robust approach called TRUES (tone recognition using extended segments) for HMM-based continuous tone recognition. The proposed approach extracts an unbroken pitch contour from a given utterance based on dynamic programming over time-domain acoustic features of AMDF (average magnitude difference function). The pitch contour of each syllable is then extended for tri-tone HMM modeling, such that the influence from inaccurate syllable boundaries is lessened. Our experimental results demonstrate that the proposed TRUES achieves 49.13% relative error rate reduction over that of the recently proposed supratone modeling, which is deemed the state-of-the-art of tone recognition that outperforms several previously proposed approaches. The encouraging improvement demonstrates the effectiveness and robustness of the proposed TRUES, as well as the corresponding pitch determination algorithm which produces unbroken pitch contours.