以字詞類別概念輔助部落格文件分群之研究

本論文研究使用ODP (Open Directory Project)目錄結構做為外部知識來源，透過ODP的查詢功能得到字詞的所屬類別作為特徵，結合文章中所有字詞所屬的類別及比重值來建構出特徵向量，希望改進單純以關鍵字擷取建立特徵向量的缺點，進而達到較好的主題式文章分群效果。此外，每個部落格中文章內容主題的集中度不同，在以K-Means演算法進行分群時，經常遇到的問題是不知道如何設定適當的聚落數目K值，本論文研究亦提出根據文章集合中各文章的特徵向量自動決定K-Means演算法的聚落數目及初始代表點，使部落格文章分群能更自動化。我們將類別特徵向量法與字詞特徵向量法分別套用在文章分群實驗上，並將分群結果以Accuracy及Purity值進行評估，評估結果顯示類別特徵向量法在測試集中大多數的部落格皆能得到比字詞特徵向量法更好的分群結果。此外，實驗顯示結合文章的標題詞與複合詞類別特徵向量可進一步提升文章分群的效果。

關鍵字

資料探勘；部落格文章分群；類別特徵向量

並列摘要

Our approach uses ODP (Open Directory Project) directory structure as the external knowledge. Through the query function of ODP, we can get categories of query word, and we set those categories as word feature. To build category feature vector of post, we merging all of categories of post words and corresponding weight of words. We hope to improve the drawback of using keyword frequency to build feature vector, and achieve better topic based clustering result. We propose a method to assist the decision of K value in K-means algorithm. We take the category relation between each posts of a blog into consideration which makes clustering more automation. We compare the clustering result of our approach with term based feature vector in Purity and Accuracy measure. The experiments show that our approach is better than term based feature vector approach. We also combine the title and phrase of a post as other feature vectors, and prove these two features can assist clustering effectively.

並列關鍵字

Data Mining ； Blog Post Clustering ； Category Feature Vector

參考文獻

[1] Li, B., Xu, S., and Zhang, J. ,“Enhancing clustering blog documents by utilizing author/reader comments,”in Proceedings of the 45th Annual Southeast Regional Conference ,2007.

[14] Dubes , Jain,” Unweighted Pair Group Method with Arithmatic Mean (UPGMA),” Numerical ecology. Elsevier. pp. 319–321. ISBN 978-0444-89250-8.

[17] G. Attardi and M. Simi.,”Blog mining through opinionated words,”in Proceedings of the Fifteenth Text Retrieval Conference (TREC), 2006.

[20] Brooks, C. H. and Montanez, N. 2006.,“Improved annotation of the blogosphere via autotagging and hierarchical clustering,”in Proceedings of the 15th International Conference on World Wide Web ,2006.

[21] Benjamin C.M. Fung , Ke Wang , Martin Ester,”Hierarchical Document Clustering Using Frequent Itemsets,”in Proceedings of SIAM International Conference ON Data Mining ,2003.

國際替代計量

以字詞類別概念輔助部落格文件分群之研究

主題瀏覽