利用函數型主成分計分及平均曲線對函數型資料進行k均值法分群之探討

集群分析旨在將資料分為數個相異性較大的群組，使組內的相似程度高，是分析高維度資料及大型資料庫的重要資料探勘工具之一；藉著經集群分析後的資料，可更容易的探索組內成員和有興趣的變量之間的關係。應用集群分析於高維度資料前，往往會先降低資料的維度，而以不同觀點去做資料降維，可能會使得到的結論有所不同。本論文的研究主題為探討對函數型資料(functional data)之觀測對象分群的問題，在文獻中(Abraham, 2003)，從平均函數(mean function)角度出發對資料做降維，再以k均值法對降維後的資料做分群。在2008年Peng 和 Muller的文章中，在所有的曲線有相同的平均函數的假設之下，利用有限維度的函數型主成份分數 (functional principal component scores) 之分佈來探查資料的分群。然而，無論是以平均函數或是共變異數函數 (covariance function)為出發點對資料做降維，所得到的群集都反映出平均函數的特性。這個現象引發了我們試圖針對這兩個方法的效用提出一套理論分析。在本文中，我們將提出說明在某些狀況下，從共變異數函數為出發點將會降低分群品質之效力。在2007年Chiou和Li的文章中提出一套以疊代重分群為主的分群演算法，在初步分群方面，主要是利用有限維度的函數型主成份分數之分佈來探查資料在平均結構上的初步分群。依據我們的推論，我們建議在初步分群中，應從平均函數的角度來探查資料的分群。

關鍵字

函數型主成分分析； k均值法分群

並列摘要

Organizing functional data into sensible groupings is one of the most fundamental modes of understanding and learning the underlying mechanism generating functional data. Clustering analysis is often employed to search for homogeneous subgroups of individuals in a data set. In Abraham et al. (2003, Scandinavian Journal of Statistics), they start with feature extraction on the mean function and use k-means clustering procedure to determine the clusters. In Peng and Muller (2008, Annals of Applied Statistics), they assume common mean function for all units and start with feature extraction on the covariance function. However, the clusters found by $k$-means clustering procedure can be explained through the characteristics of mean function of each unit. This motivates a theoretical study on comparing the utilities of these two approaches under the settings of densely observed functional data. We will only present the case that the size of clusters is two only. We will present analysis on the lose of efficiency with feature extraction on the covariance function. In Chiou and Li (2007, Journal of the Royal Statistical Society, Series B), they proposed an iterative functional clustering algorithm which apply the method used in Peng and Muller to the initial clustering stage. We advocate to use the mean function in the initial stage. An analysis is provided to support this recommendation.

並列關鍵字

functional principal component ； k-means

參考文獻

[2] Ball, G.H. and Hall, D.J. (1967). A clustering technique for summarizing multivariate data. Behavioral Science, 12, 153-155.

[3] Bunea, F, Ivanescu, A.E. and Wegkamp, M. (2011). Adaptive inference for the mean of a Gaussian process in functional data. Journal of the Royal Statistical Society, Ser. B 73, part 3.

[4] Chiou, J.-M. and Li, P.-L. (2007). Functional clustering and identifying substructures of longitudinal data. Journal of the Royal Statistical Society, Ser. B, 69, 679-699.

via subspace projection. Journal of the American Statistical Association,

2006, 223-235 .

延伸閱讀

林吟玲（2010）。對K均值分群估計潛在群體程序作平行運算〔碩士論文，國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2010.00273
劉孟庭（2009）。K-均值法聚類分類技術之研究〔碩士論文，朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-1111200915521760
王哲秋（2010）。函數型主成份分析於曲線資料分類問題之應用〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2010.01009
Wang, Z. J. (2010). The methodology of facilitating data analysis in medical informatics -information extraction from free-text data and structural data collection through the structure report interface [master's thesis, National Taiwan University]. Airiti Library. https://doi.org/10.6342/NTU.2010.01229
陳同孝、陳雨霖、劉明山、許文綬、林志強、邱永興（2006）。A New Two-Phase Clustering Algorithm Based on K-means and Hierarchical Clustering with Single-Linkage Agglomerative Method。電腦學刊，17(1)，65-75。https://www.airitilibrary.com/Article/Detail?DocID=19911599-200604-17-1-65-75-a

國際替代計量

利用函數型主成分計分及平均曲線對函數型資料進行k均值法分群之探討

主題瀏覽