透過您的圖書館登入
IP:18.222.240.21
  • 學位論文

運用詞彙重組方法改善中文斷詞

Using the Word Restructure Method to Improve Chinese Word Segmentation

指導教授 : 洪智力

摘要


本研究專注於CKIP (Chinese Knowledge Information Processing)中文斷詞系統的斷詞結果進行後處理,加入詞彙重組模組,運用簡單機率針對CKIP斷詞後的結果,進行重新組合、重組機率的運算和重組組合的比較,產生較高的重組機率之文字序列,形成詞彙重組處理後文字單元,隨即使用限制型交互資訊中文斷詞法(Constraint-Based Mutual Information, CBMI)改善中文斷詞的成效,達成二字詞以上中文詞彙的最佳化斷詞目的。CKIP中文斷詞系統由國內中央研究院詞庫小組發展,廣泛應用於各個中文資訊領域,在二字詞以下的中文詞彙斷詞表現,達99.69%的準確率,但隨著詞彙長度的增長,表現將會大幅降低。由學者洪智力提出的限制型交互資訊中文斷詞法,為交互資訊(Mutual Information, MI)中文斷詞的延伸發展,主要突破為加入中文口碑內容和領域性的交互共現關係之因素,結合至遺傳演算法(Genetic Algorithm, GA)中的適應函數,透過遺傳演算法的訓練方法,尋找出最佳化中文斷詞結果之目的。本研究提出的中文斷詞處理流程為:(1)CKIP斷詞法、(2)詞彙重組、(3)限制型交互資訊中文斷詞模型,由於本研究欲改善CKIP二字詞以上的斷詞結果,並作為最佳化斷詞的目標,故將會先使用(1)CKIP斷詞法,其次,運用本研究提出(2)詞彙重組模型,用以縮限CKIP斷詞結果歧義性較模糊的範圍,產生高機率重組文字序列,接著,再利用(3)限制型交互資訊中文斷詞法,展現二字詞以上的最佳化斷詞結果,(1)和(3)可替換為其他不同的中文斷詞方法,展現本研究架構的廣度與可應用性,另外,改良限制型交互資訊中文斷詞法,運用長詞優先法則的概念,加入詞彙長度作為適應函數權重值的加權,呈現二字詞以上最佳化斷詞結果,最後,運用中文文章領域分類,檢驗本研究中文斷詞方法與CKIP斷詞法,進行領域分類的比較評估。本研究採用三種不同領域的資料集進行實驗,分別為愛評網電影口碑評論、愛評網美食口碑評論、Urcosme美妝口碑評論,驗證上述兩種不同中文斷詞方法的效果,實驗結果顯示,本研究提出的中文斷詞模型(詞彙重組結合CKIP斷詞法和限制型交互資訊中文斷詞法),於電影和美食領域分類擁有最高的準確率表現。

並列摘要


In this paper, we propose a word restructure method which is based on simple probability to reorganize the high probabilities combinations of Chinese sentences after Chinese segment processes, also use word restructure method to conduct terms restructure, computing of restructure probability and comparing of restructure combinations, then generate the higher probability of restructure combinations to represent word units. Finally, use CBMI to achieve optimized Chinese word segmentation of two or more terms. CKIP has great segment performance in terms of short length by 99.69%, but the accuracy and segment performance will have biased error within the terms of length increase. Hung proposed CBMI that added the domain concepts from WOMs and modified the fitness function of Genetic Algorithm which use the statistical technique of Mutual Information to enhance segmentation results more correctly and efficiently, it can find not only the relevance between Chinese words but also increase the influence of domain concepts and Chinese words connections. In this paper, we proposed a processes of Chinese word segmentation which are (1) CKIP, (2) Word Restructure and (3) CBMI to improve CKIP results of Chinese word segmentation, also (1) and (3) can change other Chinese word segmentation to show the applicability of this processes. Otherwise, we improve the CBMI which added the factor of terms length to influence the computing of weightings in GA fitness function. Finally, by the comparison of this proposed model, CKIP for domain classification of Chinese documentation, we then can conclude which has better performances in Chinese word segmentation. We also take three different domain concepts WOMs (movie, food and cosmetic reviews) which are crawled on the internet as datasets for experiments in this paper. As the results, the Chinese segmentation model which we proposed has the best performance of accuracy in move and food domain datasets.

參考文獻


陳稼興, 謝佳倫, & 許芳誠. (2000). 以遺傳演算法為基礎的中文斷詞研究. 資訊管理研究, 2(2), 27–44.
游和正, 黃挺豪, & 陳信希. (2012). 領域相關詞彙極性分析及文件情緒分類之研究. 中文計算語言學期刊, 17(4), 33–47.
Chau, M., Lu, Y., Fang, X., & Yang, C. C. (2009). Characteristics of character usage in Chinese Web searching. Information Processing & Management, 45(1), 115–130.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
Fano, R. M., & Hawkins, D. (1961). Transmission of Information: A Statistical Theory of Communications. American Journal of Physics, 29(11), 793–794.

被引用紀錄


黃政華(2017)。發展適應性中文相似詞庫於口碑分類〔碩士論文,中原大學〕。華藝線上圖書館。https://doi.org/10.6840/cycu201700784

延伸閱讀