本篇論文中,我們的華語學習方法將以語音為主,唇形影像為輔的方式逐步訓練學習者。首先我們需同步擷取語音與唇形影像來建立兩組資料庫,而每組資料庫中又細分成標準語音、普通語音、劣等語音3組。語音方面利用線性預估分析與倒頻譜分析分別來取出線性預估係數(LPC)、線頻譜對係數(LSP)與梅爾倒頻譜參數(MFCC)三個參數來表示語音聲紋,此外,也取出基頻軌跡(Pitch Contour)與能量曲線(Energy Curve)分別用來表示語音聲調與強度部分。而唇形影像方面將利用區域成長、型態學、GRB向量空間分割與橢圓曲線擬合等法,來取出唇形的高度與寬度當作影像參數。之後利用動態時間扭曲(DTW)演算方式算出標準語音與其它語音在LPC、LSP、MFCC、Pitch Contour、Energy curve的差異量並配合模糊理論(Fuzzy Theory)、輻射半徑基底函數網路(RBFNN)與機率神經網路(PNN)來訂出一套判定學習者學習程度的法則,同時我們也將學習者的唇形影像與標準唇形影像利用DTW演算方式求出兩者唇形在高度與寬度的差異程度,用來提醒使用者需改進的地方達到學習互動上的最佳性。 在實驗結果中發現,3種語音聲紋參數LPC、LSP、MFCC,以MFCC來分辨語音優劣的正確率約為84%為最佳。如果再加上Pitch Contour與Energy Curve則分辨語音優劣的正確率將能明顯地再提升,其中以MFCC、Pitch Contour、Energy Curve為參數並利用DTW配合PNN的方式對語音優劣的程度的辨別為最佳,其正確率可達90%。最後將利用ROC Curve分成兩階段對整體方法評估其可行性。
A Chinese learning assisted system based on the features of speech recognition and lip shape image processing is proposed in this thesis. A test database of synchronous speech signals and images of lip shape had been built. The database includes three types of audio and video pair-- good, fair, and unqualified groups of speech and lip shape. During the learning process, the system first plays a demo speech and video, then acquires the learner’s repeat speech and video sequence of mouth, then analyzes and evaluates the utterance of the learner, and indicates to the user the correct way of lip movement and utterance and prompt for repeated practice if the evaluation is graded poorly. For speech analysis, the linear prediction coefficient (LPC), line spectrum pair (LSP) and mel-scale cepstrum (MFCC) were examined as the parameters of voiceprint. In addition, the pitch contour and energy curve were adopted as the parameter of tone and magnitude of speech signals, respectively. On the other hand, the height and width of lip shape were used as the parameters of the lip shape analysis. In the scoring stage of speech utterances, the dynamic time warping (DTW) algorithm combined with Fuzzy theory, radial basis function (RBFNN) and probabilistic neural network (PNN) techniques were applied to determining whether the test speech was qualified or not during Chinese learning process. The DTW comparison of standard database with unqualified speech signal was introduced to quantitatively prompt the lip shape modification to users. In simulation, we found that the MFCC is the best voiceprint parameter of the three voiceprint parameters and the correct rate achieved 84% by using MFCC parameters with DTW processing and PNN classification. We also found that the hybrid of MFCC, pitch contour, and energy curve parameters of speech signal could slightly promote the accuracy of classification-- could be achieved up to 90%. Finally, the Receiver Operating Characteristic Curve (ROC) curve was introduced to quantitatively evaluate the sensitivity and specificity of the performance of the proposed algorithm.