本研究提出一個混合式的斷詞法,首先透過高頻長詞斷詞法與中研院所提出的CKIP斷詞法做雙軌斷詞,接著利用CKIP斷詞結果所給予的詞性,經由詞性組合來驗證高頻長詞斷詞法所斷出來的長詞是否為有效的長詞,最後再將高頻長詞無法斷出的低頻詞,藉由CKIP斷出來做納入斷詞結果的動作。 透過這種混合斷詞的方法,本研究可以改善CKIP傾向於斷出2 - 3字的短詞問題,並且利用CKIP斷詞法所給予的詞性來重組高頻斷詞法的結果,以降低高頻長詞斷詞法太依賴頻率的缺點,藉此找出好的長詞與新出現的詞彙。 本研究以中研院平衡語料庫4.0版為評估基線,實驗結果顯示,本研究提出的方法在Precision、Recall 與 F1-measure均可達到一定水準;若將人工篩選的長詞加入平衡語料庫時,本研究的斷詞方法比其他公開的斷詞法表現來的優越。 在新詞部分,本研究使用Google新聞來做斷詞實驗,結果顯示,平均每類三篇新聞文章可以斷出7.5個新詞彙,新詞斷出率平均可達到80.82%,可見本研究提出的斷詞方法,能夠確實的擷取出新詞。
This study proposed a hybrid Chinese segmentation method. Firstly, we segment the documents using dual segmentation methods including High-Frequency Maximum Matching(HFMM) and CKIP. Secondly,we verify the HFMM generated long terms using part of speech (POS) given by CKIP and some POS combination rules. Finaly we find that generally won’t be generated by CKIP. The experimental results on Sinica corpus showed that the proposed method can achieve Precision, Recall and F1-measure to a certain level. Once adding the long terms selected manually into Sinica corpus, our method performs much better than other segment than methods. In addition,the experimental results on Google news showed that we can get 7.5 new terms in a average from news articlea of 3 categories. The average accuracy rate of new terms reached to 80.82%, indicating the proposeds can also find new terms accuratly.