透過您的圖書館登入
IP:18.221.239.148
  • 學位論文

以字詞類別概念輔助部落格文件分群之研究

An Effective Approach for Weblog Documents Clustering based on Categorical Concepts of Words

指導教授 : 柯佳伶
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


本論文研究使用ODP (Open Directory Project)目錄結構做為外部知識來源,透過ODP的查詢功能得到字詞的所屬類別作為特徵,結合文章中所有字詞所屬的類別及比重值來建構出特徵向量,希望改進單純以關鍵字擷取建立特徵向量的缺點,進而達到較好的主題式文章分群效果。此外,每個部落格中文章內容主題的集中度不同,在以K-Means演算法進行分群時,經常遇到的問題是不知道如何設定適當的聚落數目K值,本論文研究亦提出根據文章集合中各文章的特徵向量自動決定K-Means演算法的聚落數目及初始代表點,使部落格文章分群能更自動化。 我們將類別特徵向量法與字詞特徵向量法分別套用在文章分群實驗上,並將分群結果以Accuracy及Purity值進行評估,評估結果顯示類別特徵向量法在測試集中大多數的部落格皆能得到比字詞特徵向量法更好的分群結果。此外,實驗顯示結合文章的標題詞與複合詞類別特徵向量可進一步提升文章分群的效果。

並列摘要


Our approach uses ODP (Open Directory Project) directory structure as the external knowledge. Through the query function of ODP, we can get categories of query word, and we set those categories as word feature. To build category feature vector of post, we merging all of categories of post words and corresponding weight of words. We hope to improve the drawback of using keyword frequency to build feature vector, and achieve better topic based clustering result. We propose a method to assist the decision of K value in K-means algorithm. We take the category relation between each posts of a blog into consideration which makes clustering more automation. We compare the clustering result of our approach with term based feature vector in Purity and Accuracy measure. The experiments show that our approach is better than term based feature vector approach. We also combine the title and phrase of a post as other feature vectors, and prove these two features can assist clustering effectively.

參考文獻


[1] Li, B., Xu, S., and Zhang, J. ,“Enhancing clustering blog documents by utilizing author/reader comments,”in Proceedings of the 45th Annual Southeast Regional Conference ,2007.
[14] Dubes , Jain,” Unweighted Pair Group Method with Arithmatic Mean (UPGMA),” Numerical ecology. Elsevier. pp. 319–321. ISBN 978-0444-89250-8.
[17] G. Attardi and M. Simi.,”Blog mining through opinionated words,”in Proceedings of the Fifteenth Text Retrieval Conference (TREC), 2006.
[20] Brooks, C. H. and Montanez, N. 2006.,“Improved annotation of the blogosphere via autotagging and hierarchical clustering,”in Proceedings of the 15th International Conference on World Wide Web ,2006.
[21] Benjamin C.M. Fung , Ke Wang , Martin Ester,”Hierarchical Document Clustering Using Frequent Itemsets,”in Proceedings of SIAM International Conference ON Data Mining ,2003.

延伸閱讀