  • 學位論文


Document Clustering based on Approximate Word Pattern Matching and Correlation of Co-occurrence

指導教授 : 楊燕珠


本研究基於一般使用者在網路上搜尋資料來進行閱讀、瀏覽之行為,期望能迅速且準確的為使用者將大量資料分門別類,以利閱讀吸收成為真正有用的資訊的背景之下,進行文件主題集群分析。以近似詞彙樣式匹配(Approximate Word Pattern Matching )為特徵抽取(Feature Extraction),採納詞彙樣式距離資訊於共現關聯度的測量(Correlation of co-occurrence),擴充資訊檢索(Information Retrieval)中的向量空間模型(Vector Space Model) tf-idf的概念,建立近似詞彙樣式的共現關聯度與idf (pwf-idf or pa-idf)的向量空間模型,進而產生文件與文件之間的關聯性,並提出一個簡易而有效的遞迴合併高相似度資料的分群方法,來對擁有近似主題的目標文件做集群分析。經過實驗分析,我們的研究方法較Yang & Yu的『以相連二詞彙 word-bigram為特徵及先做詞彙集群,導引出相關文件,再將所含文件重複性高的集群合併,成為最後的文件分群』的結果為佳。證明近似詞彙樣式匹配能抽取更多文件的共有特徵,所提出的文件分群模式能解決多重分群產生的錯誤蔓延。




Because of users often searching related text on Internet to read or browse, this research aims at rapidly and exactly grouping a large number of text by thematic document clustering for users to efficiently absorb them during reading and convert them into really useful information. This research includes feature extraction, feature strength measurement, document-feature vector space modeling, and clustering analysis on document-document space. Feature extraction is based on the approximate word pattern matching, whose strength is evaluated by the correlation of co-occurrence involved in approximation tolerance, the distance between components of the pattern., Then we expand the tf-idf concept of vector space model from Information Retrieval to establish a document-feature vector space model by correlation of co-occurrence and idf (pwf-idf or pa-idf) In order to perform effect clustering, the document-document vector space is generated by the similarity between all pairs of documents and the similarity is calculated from document-feature vector space model. Finally, a simple and effective clustering method by recursive merging data with high similarity is presented. Through the experimental analysis, the result of our presented research method is better than that of Yang & Yu : “With the word bi-gram as its feature and by word clustering first, which will lead the documents containing them grouping together called concept clusters, and then to combine these concept clusters with high document repetition to become the final document clustering”. This research verifies that the approximate word pattern matching can extract more common features from documents and the proposed document clustering model also can solve the error propagation resulting from multiple clustering.


document clustering


[18] 林頌堅, "基於術語抽取與術語叢集技術的主題抽取 " 中文計算語言學, vol. 9, pp. 97-112, 2004.
[19] 許長謨, "從近三年報刊標題看語詞的豐富多變--兼論詞彙學的重要," 成大中文學報, vol. 11, pp. 167-200, 2003.
[1] Yen-Ju Yang, Su-Hsin Yu, "Chinese Text Clustering for Topic Detection Based on Word Pattern Relation," AI-2006 The Twenty-sixth SGAI International Conference on Artificial Intelligence, pp.408-412, Dec. 2006
[5] K. Fragos, Y. Maistros, C. Skourlas:, "Discovering Collocations in Modern Greek Language," in Proceedings of 1st International Conference on Natural Language Understanding and Cognitive Science. Porto, Portugal, 2004, pp. 151-158.
[6] J. S. Justeson, S. M. Katz, "Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text," Natural Language Engineering, vol. 1, pp. 9-27, 1995.


