本研究提出以N組連結平均法的階層式自動分群演算法,其具備任意形狀的群聚探索能力,並有效避免鏈結效應的影響而提升分群結果的正確率。與相關文獻比較,對於自動分析群聚數量能更加精確。 演算法流程先以最近的共享芳鄰(Shared Nearest Neighbors)概念做為初步的雜訊過濾,再以k-means演算法將資料分割成多個子群聚,並透過N組連結平均法的距離判斷方式執行凝聚式階層分群。階層凝聚後,透過分析每次凝聚的距離,自動取得最佳的群聚數量,並將雜訊合併到最接近的群聚中。N組連結平均法改良至Single-link,將其透過兩群間最接近資料點的距離判斷方式,擴展為兩群間最小接近面的距離判斷。本研究亦混合重力理論的精神提出質量係數,將初步過濾雜訊所遺留的離群值,優先併入最近的大群聚中,避免離群值最後被視為獨立的群聚而降低分群的效果。 本研究實驗採用人工合成的二維的資料集,分別與分割式分群演算法(k-means and PAM)、階層式分群演算法(Single-link, Complete-link, Group average, and Centroid)與結合k-means 及階層式分群法之二階段分群演算法(HKC)比較,獲得對於任意形狀資料集更正確的分群結果。另以CHAMELEON的資料集比較文獻在自動分群的正確性,獲得更具精確性的群數判斷。
This study proposed a novel method of using N-link average for hierarchical automatic clustering, which has the ability to explore arbitrary shapes and can improve the accuracy of clustering to avoid chaining effect efficiently. Comparing with relevant literature, this method is more correct for the data of automatic clustering analysis. Algorithm processes firstly uses the concept of Shared Nearest Neighbors as a preliminary noise filtering, and then uses k-means algorithm to divide the data set into multiple sub-clusters; meanwhile, run agglomerative hierarchical clustering through the way which N-link average algorithm judges the gap distance. After hierarchical cohesion, the new method can obtain the best data of clusters through every analysis for the gap and merges the noise to the nearest neighbors. N-link average is reinforced from the basis of Single-link. Its way to determine has expanded from minimum distance point to surface between two clusters. This study also combined the gravity theory to present quality factors, which merged the remaining outliers into the nearest big cluster in priority after initial noise filtering; for the sake of avoiding the outliers might be considered as an independent cluster and reduce the clustering effect. The experiment uses two-dimensional synthetic data to compare separately with Partitional Clustering Algorithm(k-means and PAM), Hierarchical Clustering Algorithms(Single-link, Complete-link, Group average, and Centroid)and a Two-Phase Clustering Algorithm based on K-means and Hierarchical Clustering with Single-Linkage Agglomerative Method and the results shows the new method we proposed can generate the clustering effect more correct for the data set of arbitrary shapes. Besides, comparison with the accuracy of automatic clustering in other relevant literature, adopting the data set of CHAMELEON can obtain more precise judgment of the number of clusters.