作文學習者之中文文法錯誤診斷系統

根據目前社會的趨勢，中文已經成為世界第二多人使用的語言，越來越多外國人開始學習中文，在還沒有熟悉中文語法的情況下，寫出來的句子時常會發生錯誤，為了能快速幫助學習者找出錯誤，進而開發了本系統。本文將以中文句子的錯誤診斷為實例，說明如何利用機器學習演算法，實作出能夠從學習者的句子當中找出錯誤，並且識別錯誤的類型。一個句子是由許多詞組成，我們析研究其中實詞、詞性以及搭配詞特徵的組合，判斷句子是否包含冗詞、選詞錯誤、詞序不當、漏字。本研究提出了以條件隨機域(Conditional Random Fields，簡稱CRF)為理論基礎的中文文法錯誤自動偵測系統；系統分為二個部份：搭配詞收集模組以及訓練測試模組。我們透過NLP-TEA2與NLP-TEA3 Task證明了以CRF 理論為基礎的句子自動偵測系統能夠擁有不錯的準確度以及精確度。

關鍵字

條件隨機域；文法偵測；搭配詞；自然語言處理

並列摘要

World Internet statistics reveals that Chinese is the world’s second most frequent Internet user language. More and more foreigners are learning Chinese. Second language (L2) learners often have problems on using word collocations appropriately. The thesis attempts to construct a Chinese as second language error diagnosis system which can diagnose the grammatical errors and help the learners to write better. The system can detect the redundant word error, missing word error, word selection error and word disorder error in an input sentence. Based on the conditional random fields (CRF) model and the combination of words, POS and collocations features, our system trains a linear tagger that can detect the grammar errors. In NLP-TEA2 and NLP-TEA3 shared tasks, the system

並列關鍵字

CRF ； Natural Language Processing ； Collection ； Grammatical error diagnosis

參考文獻

[1] Ru-Yng Chang, Chung-Hsien Wu, and Philips Kokoh Prasetyo, 2012, Error Diagnosis of Chinese Sentences Using Inductive Learning Algorithm and Decomposition-Based Testing Mechanism. ACM Transactions on Asian Language Information Processing, 11(1), article 3, March.

[2] Po-Lin Chen, Wu Shih-Hung, Liang-Pu Chen, Ping-Che Yang, Ren-Dar Yang, 2015, Chinese Grammatical Error Diagnosis by Conditional Random Fields, in Proceedings of The 2nd Workshop on Natural Language Processing Techniques for Educational Applications, pages 7–14, Beijing, China, July.

[3] Gabriela Ferraro, Rogelio Nazar, Margarita Alonso Ramos, and Leo Wanner. 2014. Towards advanced collocation error correction in Spanish learner corpora. Lang. Resour. Eval. 48, 1, pp. 45-64.

[6] Lafferty, A. McCallum, and F. Pereira., 2001, Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Intl. Conf. on Machine Learning.

[7] Lee, Lung-Hao, Liang-Chih Yu, and Li-Ping Chang. 2015. Overview of the NLP-TEA 2015 shared task for Chinese grammatical error diagnosis. In Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications (NLP-TEA 2015). 1-6.

國際替代計量

作文學習者之中文文法錯誤診斷系統

主題瀏覽