  • 學位論文


Improve the Detection of Improperly Used Chinese Characters with Noisy Channel Model and Detection Template

指導教授 : 吳世弘


過去雖然有許多中文別字偵錯與改正的研究或系統,但是仍存在些許缺點,例如︰偵錯時間費時、假警報過於頻繁、正確偵測的別字不見得能夠準確的改正為正確字、無法針對不同的使用者取向提供不同的系統效能、偵測系統建置完成後無法添加知識庫持續改進系統效能等。 為了解決上述問題,我們提出結合偵測模板與基於統計式機器翻譯的別字偵錯與改正系統。這兩個模組基於大量的混淆字集與統計學生實際寫錯的別字所建構而成。透過混淆字集我們可以自動產生數以萬計的別字偵錯模板。而統計式機器翻譯所應用到的雜訊通道模型能夠改進單純使用語言模型的效果。我們的系統可以針對近音字、近形字以及學生常犯的別字進行偵錯與建議。 實驗部份我們則是複製過去三個文獻上的系統,並且統一使用一致的資料集,針對別字偵錯與改正的方法來進行較客觀的比較。經由實驗也證實我們的系統能夠有效的降低假警報並且取得最佳的F-Score效能。


There are five drawbacks of existent present Chinese character error detection systems. Specific drawbacks are summarized as follows: 1) The high time complexity. 2) The high false alarm rate. 3) The inability to correct most error characters that the systems detected. 4) The ineptitude to provide different modes for different users. 5) The icapability to increase the system performance by adding manually edited knowledge after the systems have been built. To improve these drawbacks, we propose a system that combines a statistic module and a template matching module to detect and correct Chinese character errors. Our system automatically generates templates with the help of a dictionary and confusion sets. The statistic method is based on Noisy Channel Model, which surpasses the systems using language model only. The training sets include students’ essays with errors and a large amount of corpus. Our system can detect and correct three types of errors: pronunciation-related errors, form-related errors and common errors. In this paper, we compare our system with three methods proposed in previous works and test them with the same data set in our experiments. The experiment results show that our system can reduce the false alarm significantly and give the best performance on f-score.


5. 曾雅文. 國中學生作文病句研究. 國立高雄師範大學國文教學碩士班碩士論文. 2004年.
7. 洪大弘. 基於語言模型及正反面語料知識庫之中文錯別字自動偵錯系統. 朝陽科技大學資訊工程系碩士論文. 2009年.
11. Chen, Yong-Zhi, et al. Chinese Confusion Word Set for Automatic Generation of Spelling Error Detecting Template. The 21th Conference on Computational Linguistics and Speech Processing(Rocling2009). 2009, pp. 359-372.
14. Zhang, Lei, et al. Automatic Chinese Text Error Correction Approach Based-on Fast Approximate Chinese Word-Matching Algorithm. Proceedings of the 3rd world congress on Intelligent Control and Automation. 2000, pp. 2739-2743.
15. Zhang, Lei, et al. Approach in automatic detection and correction of errors in Chinese text based on feature and learning. Proceedings of the 3rd world congress on Intelligent Control and Automation. 2000, pp. 2744-2748.
