以SimHash改良K-Means分群效率

在資料探勘與分析的過程經常需要應用分類和分群的技術。K-Means為常用的分群方法之一，但分群的過程中需要大量的計算相似度距離和中心點，造成分群效率過低。目前已有相關研究提出更好的方法來挑選初始中心點，能將中心點分散來提升準確率。另外，在文件分到適合的群組時，降低計算相似度距離的次數。但是，當資料量越大，向量維度就越高，相似度計算也就越耗時。本論文引入SimHash降維方法以改善分群效率。SimHash本為運用於文件重複偵測，將每個文件表示成固定長度之hash value，再利用漢明距離來計算文件之間的相似度。運用SimHash在分群方法，實驗結果顯示有效降低運算成本，分群結果又不會受太大影響。

關鍵字

SimHash ； K-Means ；文件分群；相似度計算

並列摘要

K-Means is one of the popular methods for clustering, but it needs a lot of processing time in similarity calculation, which caused lower performance. Some studies proposed new methods for finding better initial centroids to provide an efficient way of assigning the data points to suitable clusters with reduced time complexity. However, with the large amount of data, vector dimension will be higher and needs more time in similarity calculation. In this paper, we propose SimHash-based K-Means algorithm that used dimensionality reduction and Hamming distance to handle large amount of data. The experiment of results showed that our proposed method can improve efficiency without significantly affecting effectiveness.

並列關鍵字

SimHash ； K-Means ； Document Clustering ； Similarity Calculation

參考文獻

[1] L. Egghe, “Untangling Herdan’s law and Heaps’ law: Mathematical and informetric arguments,” JASIST, Vol. 58, No. 5, 2007, pp.702-709.

[4] J. B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations,” Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Vol. 1, 1967, pp. 281-297.

[5] S. P. Lloyd, “Least square quantization in PCM,” IEEE Trans. Inform. Theory, Vol. 28, No. 2, 1982, pp. 129-137.

[6] M. Figueiredo and A. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. Pattern Anal. Machine Intell, Vol. 24, No. 3, 2002, pp. 381-396.

[7] G. Usman, U. Ahmad and M. Ahmad, “Improved K-Means Clustering Algorithm by Getting Initial Centroids,” World Applied Sciences Journal, Vol. 27, No. 4, 2013, pp. 543-551.

國際替代計量

以SimHash改良K-Means分群效率

全文下載

主題瀏覽