本篇論文透過聯合報提供的改稿記錄分析記者們的錯字與一般撰寫者的差異性,發現其中的修改主要是因記者們的需求而生,例如因報紙版型的數字轉國字,及一些句子風格的俗化,而到一些異體字的出現,最後一大塊屬於詞與詞之間極容易搞混的案例,如”紀錄/記錄”。從中可看出,相較於之前的改錯字資料,偏向於較年幼或初學中文的人,聯合報的錯字範圍性更廣,不只有形音的錯字,更有許多更實際的修改。在這之中,我們挑選出數千個標準的句子,做為第一個專門檢測專業編輯者中文系統的標準測試集。 本文亦整合了字的形音及相關特徵,透過 SVM 訓練分類器,並依此分類器建立新的錯字更正集,訓練後的錯字更正集整體搜尋時間下降許多。在系統上導入Noisy Channel Model 與 Language Model 的句子計分方式,並比較 HMM 與Beam Search 的差異,發現 Beam Search 的結果優於 HMM。
First, we extracted the typos from UDN edit log, and do some analysis. By the above data, we create the first benchmark to examine the Chinese Spelling Check system for professional editor, like journalist, writer and so on. Second, we build a new confusion set which can reduce search time. By extracting the features from all the pairs of Chinese character, we can train a SVM classifier to explore potential confusion set based on known typos table. Last, we compared the result between HMM and beam search. With language model and noisy channel model, we tune the parameter to find the best accuracy from our benchmark. We found that beam search work much better than the method of HMM.