透過您的圖書館登入
IP:18.217.208.72
  • 學位論文

中文斷詞方法之研究與實作

Study and Implementation on Chinese Word Segmentation Methods

指導教授 : 林宣華

摘要


本論文提供一種通用的分析文本流程和斷詞方法,開發中文斷詞系統。擷取存在文本中的領域關鍵字,進一步整理成領域關鍵字字典,改善斷詞的準確度。 首先系統的分析文本流程是根據文本的結構分解成有完整意義的句子。其二是藉由查詢字典的方式進行中文斷詞。其三是觀察字典法斷詞完成的結果,提出四項文字分析方法,在斷句中出現連續的單詞,進行組合字串的動作,產生符合語意的領域關鍵字。第一項是Rule,根據上下文的脈絡,產生具有語意的關鍵字。第二項是同音容錯,修正文章中存在的錯字,還原原始的關鍵字。第三項是Pattern,擷取數字、時間及網址等常見格式。第四項是NER,辨識人名、地名及組織名。最後是產生領域關鍵字字典,提高斷詞的效能。 為了方便驗證中文斷詞的效能,設計四個階段的實驗。實驗結果顯示,系統斷詞的效能獲得顯著的提升,F1數值上升至93%,Precision數值上升至93%。

並列摘要


This thesis provides a general text analysis process and word segmentation method to develop Chinese word segmentation system. Extract the domain keywords in the text and further organize them into a domain keyword dictionary to improve the accuracy of word segmentation. First of all, the process of systematically analyzing the text is decomposed into sentences with complete meaning according to the structure of the text. The second is to perform Chinese word segmentation by querying the dictionary. The third is to observe the results of the dictionary-based word segmentation, and propose four text analysis methods, where consecutive words appear in the segmentation, and the action of combining character strings is performed to generate semantic domain keywords. The first item is Rule, which generates semantic keywords based on the context. The second item is homophonic fault tolerance, which corrects typos in the article and restores the original keywords. The third item is Pattern, which captures common formats such as numbers, times, and URLs. The fourth item is NER, which recognizes names of people, places, and organizations. Finally, it generates a domain keyword dictionary to improve the efficiency of word segmentation. In order to verify the effectiveness of Chinese word segmentation, a four-stage experiment was designed. Experimental results show that the performance of the system word segmentation has been significantly improved, the F1 value rose to 93%, and the Precision value rose to 93%.

參考文獻


[1] CKIPtagger
https://github.com/ckiplab/ckiptagger
[2] Jieba
https://github.com/fxsjy/jieba
[3] Trie

延伸閱讀