基於高頻項目集結合近似樣式匹配之文件分群

網際網路普及，越來越多使用者在網路上搜尋相關資料進行閱讀，本研究目標是將大量文件資料進行主題集群分析，方便使用者能很快瞭解文件集有哪些主題，迅速選擇所需主題的文件進行閱讀。本研究以關聯規則之高頻項目集結合近似樣式匹配，探勘出「近似高頻樣式」作為文件特徵；並將近似匹配的距離（相似度）納入特徵權重的衡量中。此外，本研究提出以「密度和相似度為基礎之二階段分群演算法」，此方法不需預先設定群集數目，適合於大量文件分群。經過實驗結果顯示，「近似高頻樣式」的特徵數量是彈性詞對的1.42倍，單一詞彙的0.84倍，透過此特徵分群，平均召回率、精確率和正確率皆較彈性詞對、相鄰詞對、單一詞彙等特徵的分群結果為高，證明以「近似高頻樣式」確實能抽取出更多有意義且具備區別力的特徵，搭配所提出的分群演算法，可以提昇分群速度，易於決定適當的群數，並提高文件分群的品質與正確性。

關鍵字

高頻項目集；樣式匹配；特徵抽取；文件分群

並列摘要

Due to the popularization of the Internet, more and more users read desired data by directly searching from the Internet. This research aims to group a large number of texts by thematic document clustering for users rapidly realizing how many topics in those texts and picking up the interested topics to read. In order to extract more meaningful features, we propose an approach integrating frequent itemset with approximate pattern matching to mine the ”Approximate Frequent Patterns”. The distance (similarity) of approximate matching is adopted in measurement of feature weights, which is different from the traditional support count (frequency) of itemsets. In addition, the ”Two-Phase Density and Similarity-Based Clustering Algorithm” is presented. This method doesn't need setting cluster number in advance, so as to be suitable for thematic document clustering. The experimental results show that the number of ”Approximate Frequent Patterns” is 1.42 times of that of flexible word pairs and 0.84 times of that of single terms. Using this feature extraction, the clustering result in average recall, precision and accuracy are all higher than flexible word pairs, bigram and single word. This proves that ”Approximate Frequent Patterns” can really extract more meaningful and discriminative features. Besides, our presented clustering algorithm can promote the speed, easily decide appropriate cluster number, and improve the quality and accuracy of document clustering.

並列關鍵字

Frequent Itemset ； Pattern Matching ； Feature Extraction ； Document Clustering

參考文獻

楊燕珠、王千豪()。

Google Scholar

楊燕珠、邱瑞民()。

Google Scholar

Agrawal, R.,Srikant, R.(1994).Fast Algorithms for Mining Association Rules.Proceedings of International Conference on Very Large Data Bases.(Proceedings of International Conference on Very Large Data Bases).:

Google Scholar

Al-Kofahi, K.,Tyrrell, A.,Vachher, A.,Travers, T.,Jackson, P.(2001).Combining Multiple Classifiers for Text Categorization.Proceedings of the Tenth International Conference on Information and Knowledge Management.(Proceedings of the Tenth International Conference on Information and Knowledge Management).:

Google Scholar

Baeza-Yates, R.,Ribeiro-Neto, B.(1999).Modern Information Retrieval.Addison Wesley.

Google Scholar

被引用紀錄

陳寶燦（2010）。應用分群技術於同義書目之過濾與最佳化〔碩士論文，大同大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0081-3001201315105100

林莉雯（2011）。整合二進制粒子群最佳化與遺傳演算法之特徵選擇於文件分類〔碩士論文，大同大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0081-3001201315112268

林宛儒（2012）。混合式粒子群最佳化與遺傳演算法於動態文件分群〔碩士論文，大同大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0081-3001201315112525

國際替代計量

基於高頻項目集結合近似樣式匹配之文件分群

全文下載

主題瀏覽