透過您的圖書館登入
IP:3.145.97.248
  • 學位論文

以遺傳演算法為基礎 結合交互資訊之自動化中文斷詞系統

An Automatic Chinese Word Segmentation System Based on Integration of Genetic Algorithm and Mutual Information

指導教授 : 洪智力

摘要


隨著資訊科技的蓬勃發展,電腦中文資訊處理已從過去研究如何讓電腦顯示中文到了如何讓電腦理解文章的內容。舉凡任何有關於中文資訊處理的範疇,如中文語音辨識、中文資訊檢索、機器翻譯、自然語言理解、中文文字探勘等,都先必須經過中文斷詞之處理,將句子或文章切割成較小單位之詞彙,進而讓機器理解、處理。目前現代化中文斷詞系統,仍必須仰賴後端詞庫或語料庫的詞彙知識,且詞庫之建立必須耗費極大的人力成本。若有新詞、未知詞出現,而中文詞庫的更新無法跟隨上這些詞彙的產生,將會直接地影響到中文斷詞器的斷詞結果。為了因應以上問題,本研究透過提出一具有動態調適能力之中文斷詞系統,從網路上蒐集中文文本並建立、更新中文詞庫。在斷詞計算上,透過遺傳演算法計算最佳之斷詞組合,並以交互資訊之概念的適應函式,加強探討詞彙上下之間的關係。最後於中文文章分類問題上,檢驗研究提出之中文斷詞器與過去學者所提出,以遺傳演算法為基礎之中文斷詞器,以及知名中文斷詞器CKIP之成效。研究結果顯示,若是將欲分類的中文文章,當作詞庫建立的來源來建立詞庫,則使用本研究所提出之中文斷詞器,其所斷詞的中文文章在文章分類準確率上明顯高於其餘二者。且在同樣是使用遺傳演算法來計算斷詞結果的中文斷詞器上,本研究所提出結合交互資訊的適應函式,於相同比較基準上,也較優於過去以詞長、詞頻的適應函式。根據研究結果,若本研究提出的詞庫建立方法,能夠更加廣泛蒐集詞彙,將能提升現有中文斷詞器之斷詞結果。

並列摘要


Accompany with the development of information technology, the Chinese information processing has changed to meaning or context understanding rather than just showing Chinese character on computer screen. When it comes to Chinese information processing, including its subarea, Chinese Voice Recognition, Chinese Information Retrieval, Chinese Document Classification, Chinese Machine Translation, Understanding of Natural Language, Chinese Text Mining, etc. We can't go advance without the first step of Chinese Word Segmentation which splits whole documentation or sentences into meaningful and understandable words so that machine could handled. Take a look at nowadays Chinese Word Segmentation System, it still relies on knowledge of back-end lexicon or corpus and the lexicon building needs a lot of artificial works. If anew word or unknown word such as person name, place name, event name appears and the lexicon doesn't update these word immediately, it would impacts the result of Chinese Word Segmentation System. In order to take a measure of above problem, we propose a Chinese Word Segmentation System with dynamic adaptive ability in lexicon building that collects Chinese documents from Internet and uses these information to build and update lexicon automatically. In computing of segmentation result, we use Genetic Algorithm which combines with fitness function with concept of mutual information that came from statistics area so as to enhance discussion between word and word. Finally, since there are no absolutely criterion to judge a segmentation result good or not, so we take Chinese Documentation Classification to evaluate the segmentation result with another GA-based Chinese Word Segmentation System that proposed by Chen(2000) and well-known modern Chinese Word Segmentation System, CKIP which proposed by Academia Sinica. The research result shows, if we take the document set of classification as training data of lexicon building before document classification, the proposed Chinese segmenter greater than other two segmenter on classified accuracy. In the basis of same GA-based Chinese segmenter, our proposed fitness function that combines with mutual information also outperform Chen's fitness function that using word length and frequency. According to our research result, if we can massively and largely collect words by using proposed approach of lexicon building, it would great improve the result of current Chinese Word Segmentation System.

參考文獻


(1) 陳稼興, 謝佳倫, & 許芳誠. (2000). 以遺傳演算法為基礎的中文斷詞研究. 資訊管理研究期刊, 8-24.
(2) Foo, S., & Li, H. (2004). Chinese Word Segmentation and its effect on information retrieval. Information Processing and Management, 40(1), 161-190.
(3) Fung, P., & Wu, D. (1999). Statistical augmentation of a Chinese machine-readable dictionary. Natural Language Processing Using Very Large Corpora, 137.
(4) Galil, Z. (1986). Efficient algorithms for finding maximum matching in graphs. ACM Computing Surveys (CSUR), 18(1), 23-38.
(5) Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Boston: Addison-Wesley Longman Publishing Co,. Inc.

延伸閱讀