透過您的圖書館登入
IP:13.58.140.33
  • 學位論文

考慮相關性之群集分析及其在基因分群上的應用

Clustering Analysis by Attributes Interrelations and its Application to Clustering of Differentially Expressed Genes

指導教授 : 陳正剛

摘要


群集分析(Clustering Analysis)和因素分析(Factor Analysis)都是用來探究變數(attributes)間相關結構的統計分析方法,而這些相關結構通常是根據變數彼此的相似程度(similarity),或是相互關係,將變數分成有意義的群組。然而,若樣本數遠小於變數個數,在運算上會造成滿秩(Insufficiency of full-rank)而無法使用因素分析。就生物微陣列晶片(Microarray)資料分析為例,每片生物微晶片都內含數以萬計的基因表現量(gene expressions),但往往基因個數遠超過生物微晶片的片數。另一方面,使用群集分析可幫助處理變數較多但樣本數較少資料,只是使用群集分析也有幾項缺點,包括以皮爾森相關係數(Pearson correlation coefficient)作為變數間非相似程度(dissimilarity)的誤用、判斷分群結果的品質好壞,以及群組個數的決定等。 本研究的第一個目的在於探討變數間的相互關係結構,並且發展新的群集分析方法以將相互關聯的變數分群。相對於 ”R2 with PCA” 較著重於群組之間的線性關係;”Variance explanation” 不只著重變數間的相互關係,亦著重變數間的變異程度。本研究的第二個目的為提出數個評斷分群結果優劣的指標,而這些指標考慮到變數間的相互關係以及不同分群結果所能提供的變異解釋量等。最後,這些新的方法會應用到兩個案例:一為分析十九個人體血液檢測指標;另一為唐氏症生物微晶片資料分析。

並列摘要


The unsupervised classification methods, Clustering analysis and Factor analysis, intend to find meaningful structures existing in the observed attributes. These structures are usually expressed by grouping of attributes based on the similarities, or relationships among the attributes. However, the disadvantage of Factor analysis lies on insufficiency of full-rank in numerical computation. For example, in microarray data analysis, expressions of 10,000~20,000 genes are collected for each array. The number of genes is usually far larger than number of microarray. Clustering analysis, on the other hand, can help handle with a vast amount of attributes with few samples. There are some drawbacks of Clustering analysis, including of misapplying the correlation coefficient and the difficulties of evaluating the cluster quality as well as the determination of the cluster number. In this research, we first discuss characterization of interrelationships among attributes, and then develop clustering methods suitable for grouping interrelated attributes. The “R2 with PCA” method lays more stress on the linear relationships between two clusters, while the “Variance explanation” method focuses not only on interrelations among attributes but also on attributes variations. This research also proposes the statistics for the evaluation of the cluster quality, and these statistics take into considerations the interrelationships among clusters and the variances explained of clusters. Finally, we apply these novel methods to two cases; one is 19 blood tests of 24 human; and the other is Down syndrome microarray data.

參考文獻


[1] Anderberg, M. (1973). Cluster Analysis for Applications. Academic Presses.
[4] Eisen, M. B., Spellman, P. T., Brown, P.O. and Botstein, D.(1998) Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad.Sci. USA 95, 14863-14868
[6] Heyer, L.J., Kruglyak, S., Yooseph, S.(1999). Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9, 1106-1115.
[7] Milligan, G.W., Cooper, M.C. (1985). An examination of procedures for determining the number of clusters in data set. Psychometrika, 50:159—179
[8] Tibshirani. R, Walther. G, Hastie. T (2001). Estimating the number of clusters in a dataset via the Gap statistic.

延伸閱讀