Automatic Chinese Character Error Detecting System Based on N-gram Language Model and Pragmatics Knowledge Base

指導教授 : 吳世弘


至今雖然已經有中文錯別字自動訂正方法及裝置,但是仍存在有值得努力改善的缺點,例如:演算法較費時,導致系統計算量增加,以及假警報(False Alarm)的問題,偵錯結果僅是更錯而不能給予適當的建議和說明範例,而且少有實際應用在電腦輔助教學上面。因此我們提出基於N-gram語言模型(N-gram language model)搭配正反面語料知識庫的方法,以及資訊檢索的技術改善效能,開發出一套有別於過去的中文作文自動偵錯流程及系統,針對同形字、同音字及成語三種錯誤類型進行偵錯與建議,能更快更正確偵查學生文章內容的錯誤,並且最重要的是給予適當的建議,期望能達到幫助提昇學生寫作能力之目的。 系統所依賴的N-gram 語言模型,其特性是能計算字詞組合的機率之特性,認定機率高者代表字詞組合的正確性越高,而語言模型規模相當依賴大型訓練語料,因此語言模型仍有缺點需克服,例如資料稀疏(Data sparseness)的問題,可以用smoothing的方法解決;還有跨領域的問題,訓練語料的性質越不同於測試的文章,所建立的語言模型效果越差,所以語料庫也要跟著改變與適應。還有加上正反面語料知識庫的方法,幫助系統先偵測出可能有錯的字,減少系統中語言模型的計算量,提昇系統效能。 實驗利用各種不同來源的中文文章測試,包括人工設定的資料以及現實生活中台北市某國中的作文,觀察分析系統對於中文病句的判斷能力,再使用Recall與Precision兩種標準來評估系統,觀察字詞與成語的錯誤,並針對實驗結果提出錯誤分析,以及使用問卷調查,分析我們提出的系統所能帶給國中生的幫助,實驗結果顯示,我們所提出的中文偵錯系統對於字詞和成語的錯誤偵測,不但有很好的成果,並且也能提供適當的建議以及說明,如此一來我們的系統也能提供給學生來練習寫作文,增進學生的語文能力,同時也能輔助教師教學。


Essay error detection is an important function for computer-aided essay composition. Systems that can detect the spelling errors and usage errors are very helpful for students. Previous systems based on confusion sets of each Chinese character tended to give false alarms and did not explain the errors. To overcome these drawbacks, we implement an error detection system of Chinese essay, based on statistic methods and knowledge base. It can label the errors and give suggestions. Previous works focus on all possible errors from words with similar shape or pronunciation. In addition to the common error patterns, we collect corpus of various correct usage such as idiom, maxim, and slang, which provides context of potential errors. Our system make decision based on n-gram language model, once a word is labeled as an error, the system will give explanation base on the correct context. Thus, our system can offer students information to improve their essay. Traditionally, there are two difficulties on the application of language model. One is data sparseness, another is data adoptability. To deal with the drawback of N-gram language model on the data sparseness problem. We adopt several smoothing methods in our system. To overcome the adoptability, our system combines two language models to fit the usage of students. With a large knowledge base that contains thousands of common error patterns, our system can better identify error candidates. In the experiments, simulate data and real essay corpus are used. We will report the recall and precision of our system, give error analysis, and find the possible benefit of our system. We believe the system can help students and teachers not only in class but also for distance learning via Internet.


