Chinese Spell Checking Based on Noisy Channel Model

中文自動更正拼字或打字錯誤在文書處理、網路搜尋及自動作文評分都是很重要的議題。然而，中文改錯不同於一般拼音語言的拼寫改錯，中文沒有詞間的分隔符號，而且不同的中文輸入法可能會產生不同的錯字類型，所以使得中文改錯更加困難。本篇論文針對音似形似的錯誤提出了一個利用雜訊通道模型（Noisy Channel Model）改錯，首先利用網路語料庫產生混淆字集（Confusion Set）和對應的機率生成通道模型（Channel Model），接著透過雜訊通道模型中的通道模型和語言模型（Language Model）改錯。本系統的組成包含訓練階段和執行階段，在訓練階段我們利用網路語料中 n 連詞（ngrams）的頻率估計每一個字對應混淆字的機率，在執行階段，系統會根據輸入的句子產生多個候選字，最後利用通道模型和語言模型選出最合適的字。實驗結果顯示，本論文提出的方法所製作的雛形系統，有不錯的改錯精確率與召回率。

關鍵字

雜訊通道模型；語言模型；網路語料；混淆字集

並列摘要

Chinese spell checking is an important component of many Chinese NLP applications, including word processors, search engines, and automatic essay rating. Compared to English, Chinese has no word boundaries, and there are various Chinese input methods that cause different kinds of typos. Therefore, it is more difficult to develop a spell checker for Chinese. In this paper, we introduce a novel method for correcting Chinese errors based on sound or shape similarity. In our approach, potential typos in a given sentence are then corrected using a channel model and a character-based language model in the noisy channel model. In the training phase, we estimate the channel probabilities for each character based on ngrams in Web corpus. At run-time, the system generates correction candidates for each character in the given sentence and selects the appropriate correction using the channel model and the language model. The experimental results show that the proposed method achieves significantly better accuracy and recall than more complicated methods in the previous work.

並列關鍵字

Noisy Channel Model ； Character-based Language Model ； Web Corpus ； Confusion Set

參考文獻

Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-Yi Chen, and Mao-Chuan Su. Chi- nese word spelling correction based on n-gram ranked inverted index list. In Sixth International Joint Conference on Natural Language Processing, page 43, 2013.

Shih-Hung Wu, Yong-Zhi Chen, Ping-che Yang, Tsun Ku, and Chao-Lin Liu. Reducing the false alarm rate of chinese character error detection and correction. In Proceed- ings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP 2010), pages 54–61, 2010.

Hsun-wen Chiu, Jian-cheng Wu, and Jason S Chang. Chinese spelling checker based on statistical machine translation. In Sixth International Joint Conference on Natural Language Processing, page 49, 2013.

Ta-Hung Hung and Shih-Hung Wu. Automatic chinese character error detecting system based on n-gram language model and pragmatics knowledge base. Master’s thesis, Chaoyang University of Technology, 2009.

Yong-Zhi Chen and Shih-Hung Wu. Improve the detection of improperly used chinese characters with noisy channel model and detection template. Master’s thesis, Chaoyang University of Technology, 2010.

國際替代計量

Chinese Spell Checking Based on Noisy Channel Model

主題瀏覽