根據目前社會的趨勢,中文已經成為世界第二多人使用的語言,越來越多外國人開始學習中文,在還沒有熟悉中文語法的情況下,寫出來的句子時常會發生錯誤,為了能快速幫助學習者找出錯誤,進而開發了本系統。 本文將以中文句子的錯誤診斷為實例,說明如何利用機器學習演算法,實作出能夠從學習者的句子當中找出錯誤,並且識別錯誤的類型。一個句子是由許多詞組成,我們析研究其中實詞、詞性以及搭配詞特徵的組合,判斷句子是否包含冗詞、選詞錯誤、詞序不當、漏字。 本研究提出了以條件隨機域(Conditional Random Fields,簡稱CRF)為理論基礎的中文文法錯誤自動偵測系統;系統分為二個部份:搭配詞收集模組以及訓練測試模組。我們透過NLP-TEA2與NLP-TEA3 Task證明了以CRF 理論為基礎的句子自動偵測系統能夠擁有不錯的準確度以及精確度。
World Internet statistics reveals that Chinese is the world’s second most frequent Internet user language. More and more foreigners are learning Chinese. Second language (L2) learners often have problems on using word collocations appropriately. The thesis attempts to construct a Chinese as second language error diagnosis system which can diagnose the grammatical errors and help the learners to write better. The system can detect the redundant word error, missing word error, word selection error and word disorder error in an input sentence. Based on the conditional random fields (CRF) model and the combination of words, POS and collocations features, our system trains a linear tagger that can detect the grammar errors. In NLP-TEA2 and NLP-TEA3 shared tasks, the system