類別變項分群演算法的分群表現分析，以單一核甘酸多型性為例

在這個資訊發達且快速的時代，如何處理龐大的資料是一個重要的課題，而其中一種因應的方法，就是使用分群分析。分群分析是一個歷史悠久且被人廣泛使用的方法，大部分都被使用在連續性的資料處理，只有少部分用在離散型的資料，而大樣本之下的處理幾乎可以說是沒有。這篇文章中將會比較三種用來處理離散型的分群分析方法的分群表現，分別是K-modes, Hamming distance-based clustering algorithms (HD cluster), 跟 RObust Clustering using linKs (ROCK)，以三種方法在不同頻率的變項或者是不同的相關係數的樣本中的分群結果去評估他們的表現，而分群的結果將會使用五種依據去評估方法的好壞，五種評估標準是Rand Index (RI), Adjusted Rand Index (ARI), Number in Wrong Clusters, C-impurity, and Normalized Mutual Information (NMI)。模擬資料將會以R軟體裡的”ep”生成，調整資料成不平衡的資料型態且擁有不同的相關係數來說明三種方法在不同情況下的優劣。結果顯示，HD cluster的表現不論在何種情況下都比其餘的兩種方法優異。此外，也會討論在使用HD cluster的一些限制與未來的展望。

關鍵字

離散型；大樣本； K-modes ； Hamming distance-based clustering algorithms (HD cluster) ； RObust Clustering using linKs (ROCK)

並列摘要

Digital data and information are being generated at an escalating speed, especially in human modern life. Dealing with such large amounts of information has become an important issue for scientists. One way to reduce such large volume of data is via clustering. The development of clustering algorithms has a long history. Most of them, however, aimed at continuous observations, such as age and weight. For categorical data, not many algorithms have been proposed, not to mention for data that are of a greater size. In this paper we evaluate the performance of various clustering algorithms for categorical variables. Specifically, we compare three algorithms, K-modes, Hamming distance-based clustering algorithms (HD cluster), and RObust Clustering using linKs (ROCK). We investigate how their performances are affected by the frequencies of variables and the correlation between variables. The criteria for their performance evaluation are Rand Index (RI), Adjusted Rand Index (ARI), Number in Wrong Clusters, C-impurity, and Normalized Mutual Information (NMI). Simulation studies are conducted for illustrations to compare all three algorithms. The results show that the HD cluster performs better than or at least the same as the other two algorithms in all tested cases. Finally we discuss limitations and future directions for the HD cluster algorithm.

並列關鍵字

large volume ； categorical data ； K-modes ； Hamming distance-based clustering algorithms (HD cluster) ； RObust Clustering using linKs (ROCK)

參考文獻

Strehl, A. and Ghosh, J., 2002, Cluster ensembles: a knowledge reuse framework for combining partitionings. Journal of Machine Learning Research, 3: 583-617.

Benaglia, T., Chauveau, D. and Hunter, D. R. et al., 2009, An R Package for Analyzing Finite Mixture Models. Journal of Statistical Software, 32(6): 1-29.

Chen, T. L., Hsieh, D.N. and Hung, H.et al., 2014, γ-SUP: A Clustering Algorithm for Cryo-electron Microscopy Images of Asymmetric Particles. Annals of Applied Statistics, 8(1): 259-285.

Guha, S., Rastogi, R., and Shim, K., 2000, ROCK: A Robust Clustering Algorithm for Categorical Attributes. Information Systems, 25(5): 345-366.

Hubert, L., and Arabie, P., 1985, Comparing Partitions. Journal of Classification, 2(1): 193-218.

國際替代計量

類別變項分群演算法的分群表現分析，以單一核甘酸多型性為例

全文下載

主題瀏覽