透過您的圖書館登入
IP:3.144.28.50
  • 學位論文

非中文母語學習者中文寫作用詞錯誤偵測及更正之研究

Detection and Correction of Chinese Word Usage Errors for Learning Chinese as a Second Language

指導教授 : 陳信希
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


近年來,世界上越來越多人選擇學習中文,中文文法錯誤偵測及更正工具的需求因而增加。在HSK動態作文語料庫中,用詞錯誤是最頻繁的詞層級錯誤。然而,針對用詞錯誤偵測的研究並不多;在更正方面則只有處理特定詞類,如介係詞等。在這篇碩士論文中,我們提出中文用詞錯誤偵測及更正的方法。據我們所知,這是第一篇處理所有詞類之中文用詞錯誤更正之研究。 我們分三個階段處理中文用詞錯誤:(1) 子句層級之偵測、(2)詞層級之偵測、(3)更正,使用了中文字、詞、詞性和依存關係等等資訊。在第一階段中,我們訓練二元分類器來判斷一個子句是正確的、還是含有用詞錯誤,最好的模型準確率達0.84、精確率達0.95。在第二階段,我們使用雙向長短期記憶神經網路建立序列標記模型,預測每一個詞的錯誤程度。這個模型可以考慮錯誤的詞和其他上下文詞彙的關係,在超過80%的測試資料中,可以將標準答案排在前兩名。在第三階段,我們建立神經網路模型,輸入上下文以及需要被更正的詞之特徵,產生一個更正向量,這個向量可以和候選詞彙集合比較以選出適合的更正。由於可能存在不只一種更正,我們對系統的前五名候選更正進行人工標記。根據人工評估的結果,對於超過91%的測試資料,前五名中至少有一個是可接受的更正。非母語中文學習者可以使用我們的系統,在沒有語言教師指導的情況下檢查並修正自己所寫的句子。

並列摘要


Recently, more and more people around the world choose to learn Chinese as a second language. The need of Chinese grammatical error detection and correction tools is therefore increasing. In the HSK dynamic composition corpus, word usage error (WUE) is the most common type of errors at the lexical level. However, few studies focus on WUE detection, and for correction only specific types of words such as prepositions are investigated. In this thesis, we propose methods to detect and correct Chinese WUEs. To the best of our knowledge, this is the first research addressing general-type Chinese WUE correction. We deal with Chinese WUE with three stages: (1) segment-level detection, (2) token-level detection, and (3) correction. Information of character, word, POS and dependency are utilized. In the first stage, we train binary classifiers to tell whether a segment is correct, or contains some WUE. The best model achieves accuracy 0.84 and precision 0.95. In the second stage, we use bidirectional Ling-Short Term Memory to build sequence labeling model that can predict the level of incorrectness of each token. The model can consider the dependency of the erroneous token on context words and rank the ground-truth position within the top two in more than 80% of the cases. In the third stage, we build a neural network model that takes context and target erroneous token features as input and generates a correction vector, which can be compared against a candidate vocabulary to select suitable corrections. To deal with potential alternative corrections, the top five candidates are judged by human annotators. According to the human evaluation results, for more than 91% of the cases our system can propose at least one acceptable correction within a list of five candidates. With the help of our system, non-native Chinese learners can check and revise their sentences by themselves without the help of language teachers.

參考文獻


Chollampatt, S., Hoang, D. T., and Ng, H. T. (2016a). Adapting Grammatical Error Correction Based on the Native Language of Writers with Neural Network Joint Models. In Proceedings of the 2016 Conference on Empirical Methods on Natural Language Processing, pp. 1901–1911.
Hochreiter, S., and Schmidhuber, J. (1997). Long Short-Term Memory. Neural computation, 9(8), pp. 1735-1780.
Leacock, C., Chodorow, M., Gamon, M. and Tetreault, J. (2014). Automated Grammatical Error Detection for Language Learners. 2nd Edition. Morgan and Claypool Publishers.
Lee, L. H., Yu, L. C., and Chang, L. P. (2015). Overview of the NLP-TEA 2015 Shared Task for Chinese Grammatical Error Diagnosis. In Proceedings of the 2nd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2015), pp. 1-6.
McNemar, Q. (1947). Note on the Sampling Error of the Difference Between Correlated Proportions or Percentages. Psychometrika, 12(2), pp. 153-157.

延伸閱讀