運用即時斷詞弱點辨識改善中文斷詞系統效能之研究

我們在本篇論文中提出了一個運用即時斷詞弱點辨識改善中文斷詞系統效能的方法，並可將此概念運用到任何學習演算法的模型中。主要是利用基礎模型產生的第一階段測試資料集，利用斷詞弱點辨識找出在斷詞中可能產生錯誤的區段，在將找到的錯誤候選區段利用網路強大的資源庫中找出有可能的錯誤更正詞，並將它加入更正詞辭典，在測試資料階段利用字典對特徵進行重新的標記，達到不用重新訓練模型就可以使用更正詞辭典的效果，進而去改善整體斷詞系統的效能。我們將此概念運用於CRF模型中，並提出了三種錯誤候選區段的選擇方法和一套利用網路搜尋結果數和維基標題等尋找更新字詞的機制。在實驗中利用SIGHAN 2005 所提供的公開資料集，使用我們的方法下整體F-measure與基礎模型相比皆有所提升，而提升的效能與未知詞比率呈現強烈的正相關，且在未知詞的召回率方面對每個資料集都有明顯得成長，證明我們的系統對未知詞的擷取有很大的提升能力。而在面對SIGHAN 2014與NLPCC 2015等兩個資料內容為微博的資料集中，在基礎模型下加入我們效能都能有很顯著的提升，代表我們在處理社群網路等新興詞彙時有相當好的處理能力。

關鍵字

斷詞；新詞擷取；弱點辨識

並列摘要

We propose a new method to improve Chinese word segmentation system by identifying weak segmentation on the fly, and this idea can be implemented in any kind of statistical learning model. The method can be simply described in three steps. First, we use the segmentation result generated by our baseline model to identifying weak segmentations and denote them as error candidate tokens. Second, we use the powerful Internet resource to correct the error candidate tokens and collect the new words we found into a correction word dictionary. Third, we use this dictionary to relabel the testing data rather than retraining the model to improve the overall performance. We implement this idea by CRF and propose three ways to find the error candidate tokens, we also design a mechanism using the Internet searching counts and the title of wikipedia’s pages to find out new words which might be wrong segmented at the first place. In experiment in SIGHAN 2005 Chinese word segmentation Bakeoff, our work can make the baseline model has better performance in F-measure than before. In addition, it also has a significant improvement in OOV recall. In SIGHAN 2014 and NLPCC 2015 datasets, which are made of weibo data, the performance of the baseline model has a great promotion after applying our work on it. This shows that our work is powerful when dealing with the data of social media that contains lots of new words.

並列關鍵字

segmentation ； new word detection ； identify weak segmentation

參考文獻

[7] Longkai Zhang, Li Li, Zhengyan He, Houfeng Wang, and Ni Sun. 2013a. Improving Chinese Word Segmentation on Micro-blog Using Rich Punctuations. In Proceedings of ACL’13, pages 177–182.

[10] Xipeng Qiu, Qi Zhang and Xuanjing Huang, FudanNLP: A Toolkit for Chinese Natural Language Processing, In Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL), 2013

[1] https://taku910.github.io/crfpp/