透過您的圖書館登入
IP:3.149.213.209
  • 期刊

基於高頻項目集結合近似樣式匹配之文件分群

Document Clustering Based on Frequent Itemset Integrated with Approximate Pattern Matching

摘要


網際網路普及,越來越多使用者在網路上搜尋相關資料進行閱讀,本研究目標是將大量文件資料進行主題集群分析,方便使用者能很快瞭解文件集有哪些主題,迅速選擇所需主題的文件進行閱讀。本研究以關聯規則之高頻項目集結合近似樣式匹配,探勘出「近似高頻樣式」作為文件特徵;並將近似匹配的距離(相似度)納入特徵權重的衡量中。此外,本研究提出以「密度和相似度為基礎之二階段分群演算法」,此方法不需預先設定群集數目,適合於大量文件分群。經過實驗結果顯示,「近似高頻樣式」的特徵數量是彈性詞對的1.42倍,單一詞彙的0.84倍,透過此特徵分群,平均召回率、精確率和正確率皆較彈性詞對、相鄰詞對、單一詞彙等特徵的分群結果為高,證明以「近似高頻樣式」確實能抽取出更多有意義且具備區別力的特徵,搭配所提出的分群演算法,可以提昇分群速度,易於決定適當的群數,並提高文件分群的品質與正確性。

並列摘要


Due to the popularization of the Internet, more and more users read desired data by directly searching from the Internet. This research aims to group a large number of texts by thematic document clustering for users rapidly realizing how many topics in those texts and picking up the interested topics to read. In order to extract more meaningful features, we propose an approach integrating frequent itemset with approximate pattern matching to mine the ”Approximate Frequent Patterns”. The distance (similarity) of approximate matching is adopted in measurement of feature weights, which is different from the traditional support count (frequency) of itemsets. In addition, the ”Two-Phase Density and Similarity-Based Clustering Algorithm” is presented. This method doesn't need setting cluster number in advance, so as to be suitable for thematic document clustering. The experimental results show that the number of ”Approximate Frequent Patterns” is 1.42 times of that of flexible word pairs and 0.84 times of that of single terms. Using this feature extraction, the clustering result in average recall, precision and accuracy are all higher than flexible word pairs, bigram and single word. This proves that ”Approximate Frequent Patterns” can really extract more meaningful and discriminative features. Besides, our presented clustering algorithm can promote the speed, easily decide appropriate cluster number, and improve the quality and accuracy of document clustering.

參考文獻


楊燕珠、王千豪()。
楊燕珠、邱瑞民()。
Agrawal, R.,Srikant, R.(1994).Fast Algorithms for Mining Association Rules.Proceedings of International Conference on Very Large Data Bases.(Proceedings of International Conference on Very Large Data Bases).:
Al-Kofahi, K.,Tyrrell, A.,Vachher, A.,Travers, T.,Jackson, P.(2001).Combining Multiple Classifiers for Text Categorization.Proceedings of the Tenth International Conference on Information and Knowledge Management.(Proceedings of the Tenth International Conference on Information and Knowledge Management).:
Baeza-Yates, R.,Ribeiro-Neto, B.(1999).Modern Information Retrieval.Addison Wesley.

被引用紀錄


陳寶燦(2010)。應用分群技術於同義書目之過濾與最佳化〔碩士論文,大同大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0081-3001201315105100
林莉雯(2011)。整合二進制粒子群最佳化與遺傳演算法之特徵選擇於文件分類〔碩士論文,大同大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0081-3001201315112268
林宛儒(2012)。混合式粒子群最佳化與遺傳演算法於動態文件分群〔碩士論文,大同大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0081-3001201315112525

延伸閱讀