一個產生長詞與新詞的中文混合斷詞系統

本研究提出一個混合式的斷詞法，首先透過高頻長詞斷詞法與中研院所提出的CKIP斷詞法做雙軌斷詞，接著利用CKIP斷詞結果所給予的詞性，經由詞性組合來驗證高頻長詞斷詞法所斷出來的長詞是否為有效的長詞，最後再將高頻長詞無法斷出的低頻詞，藉由CKIP斷出來做納入斷詞結果的動作。透過這種混合斷詞的方法，本研究可以改善CKIP傾向於斷出2 - 3字的短詞問題，並且利用CKIP斷詞法所給予的詞性來重組高頻斷詞法的結果，以降低高頻長詞斷詞法太依賴頻率的缺點，藉此找出好的長詞與新出現的詞彙。本研究以中研院平衡語料庫4.0版為評估基線，實驗結果顯示，本研究提出的方法在Precision、Recall 與 F1-measure均可達到一定水準；若將人工篩選的長詞加入平衡語料庫時，本研究的斷詞方法比其他公開的斷詞法表現來的優越。在新詞部分，本研究使用Google新聞來做斷詞實驗，結果顯示，平均每類三篇新聞文章可以斷出7.5個新詞彙，新詞斷出率平均可達到80.82％，可見本研究提出的斷詞方法，能夠確實的擷取出新詞。

關鍵字

中文斷詞；新詞；長詞； CKIP ；高頻長詞；詞性組合

並列摘要

This study proposed a hybrid Chinese segmentation method. Firstly, we segment the documents using dual segmentation methods including High-Frequency Maximum Matching(HFMM) and CKIP. Secondly,we verify the HFMM generated long terms using part of speech (POS) given by CKIP and some POS combination rules. Finaly we find that generally won’t be generated by CKIP. The experimental results on Sinica corpus showed that the proposed method can achieve Precision, Recall and F1-measure to a certain level. Once adding the long terms selected manually into Sinica corpus, our method performs much better than other segment than methods. In addition,the experimental results on Google news showed that we can get 7.5 new terms in a average from news articlea of 3 categories. The average accuracy rate of new terms reached to 80.82%, indicating the proposeds can also find new terms accuratly.

並列關鍵字

CKIP ； Maximum Matching ； POS combination ； Chinese Word Segmentation ； New terms ； Long terms

參考文獻

[8] 平震宇，(2007) 『一個適用於行動裝置的網頁搜尋結果分群系統之研究』，元智大學資訊管理研究所碩士論文。

[12] 楊盛帆，(2009) 『以整合式規則來做網路論壇上的 3C 產品口碑分析』，元智大學資訊管理研究所碩士論文。

[7] 林千翔，(2006) 『基於特製隱藏式馬可夫模型之中文斷詞研究』，國立中央大學資訊工程研究所碩士論文。

[4] 江振宇，(2004) 『中文斷詞器之改進』，國立交通大學電信工程學系碩士論文。

[6] 邱兆揚，(2005) 『利用Google互聯網分類新聞語料之新詞自動擷取技術支援詞庫式中文斷詞系統』，國立臺灣師範大學應用電子科技研究所碩士論文。

被引用紀錄

楊盛安（2013）。利用語意相關詞和基因演算法來逼近中文搜尋引擎排名〔碩士論文，元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2013.00049

許巧靜（2011）。類別相關詞對搜尋引擎的搜尋結果排名之影響〔碩士論文，元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2011.00190

國際替代計量

一個產生長詞與新詞的中文混合斷詞系統

全文下載

主題瀏覽