一個使用空間交會凝聚技術之有效率的網格式分群演算法

近十幾年來資訊科技的快速發展，資料庫的儲存空間與可儲存的資料量日益增加，在這資訊爆炸的時代，若能掌握關鍵資訊就能贏在起跑點上，資料探勘技術的價值在於其能夠從眾多的資料之間挖掘出隱含的有用資訊，而在資料探勘領域中，資料分群技術為最普遍被使用到的方法。資料分群的作用是透過分析並以資料之間相似程度的高與低作為依據，將資料歸屬成群。近幾年來有許多的資料分群演算法被提出，根據分群的概念準則可歸納為切割式、密度式、階層式、網格式及混合式等分群技術，當中GF-DBSCAN屬於混合式分群技術，它結合了密度式及網格式之概念，以網格限制搜尋範圍的方法改良FDBSCAN，降低了搜尋的範圍及次數，在保有高準確率及雜訊濾除率之下大幅降低分群的時間成本。唯可惜其所能提升的分群效率仍然有限。因此，本論文提出一個新的分群演算法稱為ASSI，此方法改良了GF-DBSCAN的合併概念，分群過程以網格式的方式進行，一開始將資料集切割成格子架構，並依據設定的網格密度門檻，過濾掉含點量較少的雜訊格，再以九宮格的方式擴散，擴散過程中以空間交會凝聚的方式進行合併。經過實驗比較後證明ASSI有效利用網格式分群技術的特性降低合併的次數提升了分群的效率，且維持高分群正確率及高雜訊濾除率。相較於DBSCAN、FDBSCAN、GF-DBSCAN、CLIQUE及ANGEL大幅的改善時間成本的問題，實驗結果証明本論文所提出的ASSI演算法能夠有效的做大型資料庫分群之應用。

關鍵字

資料探勘；資料分群；網格式分群

並列摘要

Due to the advancement and development of information technology, both the stored in database and the amount of data have increased. The data mining techniques can extract the implicit and useful information, and data clustering is the most commonly used method in data mining. In the past, numerous clustering methods have been proposed, and lower time cost and high correctness is priority concern. Therefore, this thesis presents a new clustering algorithm, called “ASSI＂, which adopts grid-based clustering, neighborhood 8-square searching, agglomerate the space which is intersected. Simulation results indicate that the proposed algorithm “ASSI” clusters large databases very quickly and it can filter noises. They also reveal that the proposed new clustering algorithm performs almost identical or even better clustering than several existing well-known approaches such as the DBSCAN, FDBSCAN, GF-DBSCAN, CLIQUE and ANGEL algorithms. Thus, the proposed ASSI performs well and is simple to implement.

並列關鍵字

Data Mining ； Data clustering ； Grid-based clustering

參考文獻

[6] Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 94–105. ACM Press, Seattle, Washington (1998)

[7] Borah, B., Bhattacharyya, D.K.: An Improved Sampling-Based DBSCAN for Large Spatial Databases. In: Proceedings of International Conference on Intelligent Sensing and Information Processing, pp. 92-96 Chennai, India (2004)

[11] Karypis, G., Han, E.H., Kumar, V.: CHAMELEON: A Hierarchical Clustering Using Dynamic Modeling. IEEE Computer: Special Issue on Data Analysis and Mining. Vol. 32, no. 8, pp. 68-75 (1999)

[12] Liu, Bing. “A Fast Density-Based Clustering Algorithm for Large Databases,” Machine Learning and Cybernetics, 2006 International Conference on, pp. 996-1000 ( 2006)

[13] MacQueen, J.B. “Some Methods of Classification and Analysis of Multivariate Observations,” Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297 ( 1967)

被引用紀錄

陳而設（2016）。以索引值導向為基礎具高效率的新網格群集演算法〔碩士論文，國立屏東科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0042-1805201714165802

國際替代計量

一個使用空間交會凝聚技術之有效率的網格式分群演算法

全文下載

主題瀏覽