背景 隨著生物科技的進步,研究者已經可以取得大量的遺傳標記,來針對複雜性遺傳疾病相關研究進行分析;而單一核苷酸多型性(Single nucleotide polymorphism)就是一個常見的遺傳生物標記。因此,在一個研究中,研究者往往可以收集到數以萬計的標記;然而,如此高的資料維度通常會帶來分析上的困難。所以,如何有效地降低分析時參數的維度就成為一個重要的課題。再者,如果我們能將多個相似的單一核苷酸多型性整合成為一群,那就可以利用它們對疾病的共同影響,來幫助後續的分析。即使已經利用了生物上的功能或實驗方法將這些單一核苷酸多型性資料分類,如此的分類類別往往仍留有很高的資料維度,因此也必須借助其它方法進一步進行分群。本篇研究提出了一個適合單一核苷酸多型性或一群單一核苷酸多型性資料的分群方法。 材料與方法 在相似程度的測量方面:漢明距離是一個簡單而且常用於測量字串資料相似程度的度量。本文主要利用漢明距離來當作單一核苷酸多型性基因型之間的相似度度量,並提出三種針對單一核苷酸多型性群集和群集之間的相似度測量方式。這三種方法以不同的測度來測量不同群集之間的距離關係。為了瞭解本文所提出的方法的表現,本文利用英國的Wellcome Trust Case Control Consortium study (WTCCC)中的冠狀動脈心臟病(coronary artery disease, CAD)研究所蒐集的單一核苷酸多型性資料進行模擬,並利用正確率、敏感度、特異度、調整後的芮氏指標(adjusted Rand index)、和標準化互訊息(normalized mutual information)作為分群結果的比較標準。 結果 本文提出如何利用漢明距離來對單一核苷酸多型性資料進行分群,在本文所做的模擬中,可以得到和文獻方法一樣甚至更好的結果。 討論 本文提出利用漢明距離測量單一核苷酸多型性的相關程度,此度量方法和連鎖不平衡有相似的概念,但是更著重在人與人之間相似度的測量上,並且不需要考慮標記的物理距離。另外,對於單一核苷酸多型性之基因型的編碼方式,在使用上亦可以隨著不同的遺傳模式而改變。至於在分群分析中,有些研究者關心如何決定群集的個數,本文提出在研究者沒有其他資訊的情況下,可以利用群集相關性的最大差異來當作群集個數的決定依據.
Background With the recent advancement in laboratory technology, scientists are able to genotype thousands or millions of markers for genetic association studies of complex diseases. This large number of markers leads to difficulties in analysis. Therefore, reducing effectively the dimension for further analysis becomes an important issue. Another advantage of dimension reduction is, after clustering SNPs sharing similar features into one group, the small effect of each single SNP will not be overlooked. Such clustering may help laboratory scientists to identify novel association between markers and disease, and may help biological interpretations. The aim of this study is to provide a suitable clustering method for SNP observations. Materials and methods Among dissimilarity measures, Hamming distance is a simple and popular dissimilarity measure for string data. Here based on Hamming distance we propose three dissimilarity measures to represent the distance between two SNP clusters. Next, we use this measurement in a clustering algorithm, particularly the hierarchical clustering algorithm for its better explanation of subgroup structures, to create a tree structure called dendrogram for the data under study. To evaluate the performance of our approaches, we simulate SNP genotypes based on the coronary artery disease (CAD) study from the Wellcome Trust Case Control Consortium (WTCCC). And we use accuracy, sensitivity, specificity, adjusted Rand index and normalized mutual information as the criteria to compare with other existing methods. Results We propose a hierarchical clustering method for SNP sets based on Hamming distance. The simulation studies show that our approaches perform better or as well as than those proposed in literature. When the number of clusters is unknown and needs to be determined, we recommend the maximum difference in adjacent dissimilarity measures as a threshold. Discussion Our proposal utilizes Hamming distance to measure the similarity between SNP strings. This is similar to LD but focuses more on inter-personal similarity. The approaches can be extended to other modes of inheritance by changing the coding of SNP genotypes.