非中文母語學習者中文寫作語序偏誤偵測及修正

目前中文已經成為全球第二大外語，學習中文的外國人越來越多，但綜觀過往的研究，少有研究專門針對電腦修正中文語序偏誤的主題，僅有不少語言學領域的研究提出外國語言及中文之間的語序對比分析，或是在機器統計翻譯領域的研究中提及前置或後置的語序處理。另有一些以電腦修正中文語法偏誤的研究，但提及語序偏誤修正的部分甚少，且大多仍依賴語言學專家來訂立語法規則（rule-based）的方式來偵測及修正中文語法偏誤。本研究從北京語言大學漢語國際教育技術研發中心“HSK動態作文語料庫”蒐集非中文母語學習者寫作中文的真實偏誤中文句子，並針對“語序錯誤”類別的句子，由兩位中文母語的研究人員作出修正標記，產生出修正前後的中文對句語料庫。研究中由HSK語料庫擷取實驗資料集，及從Google中文網路5-gram語料庫抽取特徵訓練，利用CRF偵測中文句子中可能出現語序偏誤的區段，並針對偏誤區段內的字詞重新排序，產生一系列當中可能包含正確語序句子的候選修正句子，再以Ranking SVM對候選修正句子進行排名，找出最有可能為根據原句修正後正確語序的句子。在無需人工建立語法規則下，本研究的中文語序偏誤偵測及修正過程由supervised learning method處理。在語序偏誤偵測實驗中，本系統找出含有語序偏誤的句子區段準確率達83.4%，而語序偏誤修正實驗分別根據語序偏誤偵測的標準答案及本系統的語序偏誤偵測結果來測試。在假設所有語序偏誤偵測被能被系統偵測出的情況下，正確語序的修正句子平均排行第三名；在連同本系統的語序偏誤偵測結果下的整體效能，正確語序句子平均排行第七名。

關鍵字

中文語序；語序偏誤偵測；語序偏誤修正； HSK語料庫；文法偏誤

並列摘要

Chinese language has become the second most popular language in the world and foreigners tend to learn Chinese language. However, there is not much research specific to detection and correction of Chinese word ordering errors. Some research is related to comparison between word orderings of Chinese and foreign languages in linguistics, and research about processing of word orderings in SMT. There is also research about Chinese grammar errors detection and correction by computer systems, but most of them depend on building rule-based model with linguistic knowledge. The topic of this research is relatively very new. In this research, the experiment dataset is extracted from HSK dynamic composition corpus that was built by Beijing Language and Culture University. Sentences annotated with “word ordering errors” tags are extracted. Two corrections to each sentence are annotated by two native Chinese researchers. A dataset with pairwise of one original sentence and two correction-annotated sentences is built for testing. Word n-gram features from Google Web Chinese 5-gram corpus are used in models training. CRF is used to detect possible segments with word ordering errors in sentence. To generate the candidates with correct word ordering, words in the detected segments are reordered. In order to find out the best candidate for word ordering correction, Ranking SVM is used to solve the ranking problem of candidates with POS bigram and POS trigram features. Without the use of building rule-based models, the model of detection and correction of word ordering errors are trained by a supervised learning method. The experiment result shows that our system achieves 83.4% accuracy to find out the sentence segments with word ordering errors. Candidates with correct word ordering ranked 7 in average according to the performance of word ordering errors detection by our system, and ranked 3 in average assuming all word ordering errors are able to be detected.

並列關鍵字

Chinese Word Ordering ； Word Ordering Errors Detection ； Word Ordering Errors Correction ； HSK Corpus ； Grammar Errors

參考文獻

Huang, A. T., Kuo, T. T., Lai, Y. C., & Lin, S. D. (2010). Discovering Correction Rules for Auto Editing. 中文計算語言學期刊, 15(3-4), 219-235.

Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22-29.

Klinger, R., & Tomanek, K. (2007). Classical probabilistic models and conditional random fields. TU, Algorithm Engineering.

Lafferty, J., McCallum, A., & Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.

Leacock, Claudia, et al. "Automated grammatical error detection for language learners." Synthesis lectures on human language technologies 3.1 (2010): 1-134.

國際替代計量

非中文母語學習者中文寫作語序偏誤偵測及修正

全文下載

主題瀏覽