基於隱藏式馬可夫模型的中文改錯

本篇論文透過聯合報提供的改稿記錄分析記者們的錯字與一般撰寫者的差異性，發現其中的修改主要是因記者們的需求而生，例如因報紙版型的數字轉國字，及一些句子風格的俗化，而到一些異體字的出現，最後一大塊屬於詞與詞之間極容易搞混的案例，如”紀錄／記錄”。從中可看出，相較於之前的改錯字資料，偏向於較年幼或初學中文的人，聯合報的錯字範圍性更廣，不只有形音的錯字，更有許多更實際的修改。在這之中，我們挑選出數千個標準的句子，做為第一個專門檢測專業編輯者中文系統的標準測試集。本文亦整合了字的形音及相關特徵，透過 SVM 訓練分類器，並依此分類器建立新的錯字更正集，訓練後的錯字更正集整體搜尋時間下降許多。在系統上導入Noisy Channel Model 與 Language Model 的句子計分方式，並比較 HMM 與Beam Search 的差異，發現 Beam Search 的結果優於 HMM。

關鍵字

中文改錯；隱藏式馬可夫模型；集束搜尋；向量支持機；雜訊通道模型；語言模型

並列摘要

First, we extracted the typos from UDN edit log, and do some analysis. By the above data, we create the first benchmark to examine the Chinese Spelling Check system for professional editor, like journalist, writer and so on. Second, we build a new confusion set which can reduce search time. By extracting the features from all the pairs of Chinese character, we can train a SVM classifier to explore potential confusion set based on known typos table. Last, we compared the result between HMM and beam search. With language model and noisy channel model, we tune the parameter to find the best accuracy from our benchmark. We found that beam search work much better than the method of HMM.

並列關鍵字

Chinese Spelling Check ； Hidden Markov Model ； beam search ； Noisy Channel Model ； Language model

參考文獻

[1] Yih-Ru Wang, Liang-Chun Chang, Yeh-Kuang Wu and Yuan-Fu Liao (2013), “Conditional Random Field-based Parser and Language Model for Traditional Chinese Spelling Checker”, The 7th SIGHAN Workshop on Chinese Language Processing (SIGHAN-7).

[2] Shuiyuan Zhang, Jinhua Xiong, Jiapeng Hou, Qiao Zhang and Xueqi Cheng (2015), “HANSpeller++: A Unified Framework for Chinese Spelling Correction”, Eighth SIGHAN Workshop on Chinese Language Processing.

[3] Chao-Lin Liu, Min-Hua Lai, Kan-Wen Tien, Yi-Hsuan Chuang, Shih-Hung Wu and Chia-Ying Lee (2011), “Visually and phonologically similar characters in incorrect Chinese words: Analyses, identification, and applications”, ACM Transactions on Asian Language Information Processing, volume 10, pages 39.

[4] Yih-Jeng Lin, Feng-Long Huang and Ming-Shing Yu (2002), “A CHINESE SPELLING ERROR CORRECTIONS SYSTEM”, Processings of the Seventh Conference on Artificial Intelligence and Applications.

[5] Chuan-Jie Lin and Wei-Cheng Chu (2015), “A Study on Chinese Spelling Check Using Conufision Sets and N-grams Statistics.”, International Journal of Computational Linguistics and Chinese Language Processing. Volume 20, pages 23-47.

國際替代計量

基於隱藏式馬可夫模型的中文改錯

主題瀏覽