基於高頻項目集結合近似樣式匹配之文件分群

網際網路普及，越來越多使用者在網路上搜尋相關資料進行閱讀，本研究目標是將大量文件資料進行文件主題集群分析，方便使用者能很快瞭解文件集有哪些主題，迅速選擇所需主題的文件進行閱讀。本研究以關聯規則之高頻項目集結合近似樣式匹配，探勘出「近似高頻樣式」(Approximate Frequent Pattern)作為文件特徵；並將近似匹配的距離（相似度）納入特徵權重的衡量中，有別於傳統高頻項目集之頻率計算。此外，本研究提出以「密度和相似度為基礎之二階段分群演算法」 (Two-Phase Density and Similarity-Based Clustering Algorithm)，此方法不需預先設定群集數目，適合於大量文件分群。經過實驗結果顯示，「近似高頻樣式」的特徵數量是彈性詞對的1.42倍，單一詞彙的0.84倍，透過此特徵分群，平均召回率、精確率和正確率皆較彈性詞對、相鄰詞對(bigram)、單一詞彙等特徵的分群結果為高，證明以「近似高頻樣式」(Approximate Frequent Pattern)確實能抽取出更多有意義且具備區別能力的特徵，搭配所提出的分群演算法，可以提昇分群速度，易於決定適當的群數，提高文件分群的品質與正確性。

關鍵字

關聯規則；高頻項目集；樣式匹配；特徵抽取；文件分群

並列摘要

Due to the popularization of the Internet, more and more users read desired data by directly searching from the Internet. This research aims to group a large number of texts by thematic document clustering for users rapidly realizing how many topics in those texts and picking up the documents of interested topics to read. In order to extract more meaningful features, we propose an approach integrating frequent itemset with approximate pattern matching to mine the “Approximate Frequent Pattern”. The distance (similarity) of approximate matching is adopted in measurement of feature weights, which is different from the traditional support count (frequency) of itemsets. In addition, the “Two-Phase Density and Similarity-Based Clustering Algorithm” is presented. This method doesn’t need setting cluster number in advance, so as to be suitable for thematic document clustering. The experimental results show that the number of “Approximate Frequent Pattern” is 1.42 times of that of flexible word pairs and 0.84 times of that of single terms. Using this feature extraction, the clustering result in average recall, precision and accuracy are all higher than flexible word pairs, bigram and single word. This proves that “Approximate Frequent Pattern” can really extract more meaningful and discriminative features. Besides, our presented clustering algorithm can promote the speed, easily decide appropriate cluster number, and improve the quality and accuracy of document clustering.

並列關鍵字

Association Rule ； Frequent Itemset ； Pattern Matching ； Feature Extraction ； Document Clustering

參考文獻

[1] 楊燕珠、邱瑞民，『未知群數的模糊分群之研究 Fuzzy Clustering with Unknown Cluster Number』， ICIM 2007 十八屆國際資訊管理學術研討會，May. 2007。

[2] 楊燕珠、王千豪，『基於近似詞彙樣式匹配之主題式文件分群 Thematic Document Clustering Based on Approximate Word Pattern Matching』，CIMP 2007第13屆海峽兩岸資訊管理發展與策略學術研討會，pp.388-393，Aug. 2007。

[6] Beil, F., Ester, M. and Xu, X., “Frequent Term-Based Text Clustering.” In Proceedings of KDD, pp.436-442, 2002.

[8] Chen, F., Han, K. and Chen, G., “An Approach to Sentence-Selection-Based Text Summarization,” IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, (TENCON '02), pp.489- 493, Oct. 2002.

[11] Fung, B. C. M., Wang, K. and Ester, M., “Herarchical Document Clustering Using Frequent Itemsets,” In SIAM Int. Conf. Data Mining, 2003.

被引用紀錄

陳信夫（2011）。基於字詞關係動態建立階層分群〔碩士論文，國立中央大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0031-1903201314413147

國際替代計量

基於高頻項目集結合近似樣式匹配之文件分群

未授權

主題瀏覽