透過您的圖書館登入
IP:3.142.90.205
  • 學位論文

拷貝數變異資料關聯性檢定之分析策略

Development of analytic strategies to improve association testing with copy number variation (CNV) data

指導教授 : 洪弘 郭柏秀

摘要


拷貝數變異是一種DNA結構上的變異,近年來已有許多研究指出它與許多複雜性疾病有關。陣列式晶片技術可幫助我們快速的掃描大量拷貝數變異的訊號,也有許多新發展的統計方法嘗試從實驗偵測的訊號值估計出拷貝數。這些方法主要面臨的問題在於離散的拷貝數數值需要從一連串標記所讀出的連續訊號值來估計,進而我們還希望進行關聯性檢定來找出拷貝數變異與疾病的關係。 在拷貝數變異分析的第一階段,我們通常會由全基因體的訊號來找尋和疾病相關的拷貝數變異片段。由於拷貝數變異是一種稀少且影響力相對小的一種DNA變異,使得我們很難在病人與非病人間進行比較。近年來許多研究為了節省成本,開始採用混合樣本之全基因體掃描研究的分析策略,然而由於拷貝數變異的複雜性,此策略若要應用到拷貝數變異的偵測上,將面臨更大的挑戰。在這個研究中,我們希望能發展一套程序來幫助我們使用混合樣本來找出拷貝數變異與疾病之間的關係。我們建立一系列的篩選方法來過濾掉可能是偽陽性的結果,並將這套程序應用到躁鬱症的拷貝數變異資料中。我們先定義出每批混合樣本的拷貝數變異區段,再挑選出在病例組與對照組中有不同分佈趨勢的拷貝數變異區段,最後我們透過整合這些拷貝數變異區段所對應到的基因功能及比對過去的發表過的相關研究,來探測拷貝數變異與躁鬱症之間的關聯性。 在拷貝數變異分析的第二階段,我們可透過集群分析從特定片段所取得的驗證訊號值中估計出拷貝數。但由於拷貝數變異的資料在分群的趨勢較不明顯且有離群值,使得我們很難找出正確的分類。γ-SUP是一種新發展的方法,它能解決拷貝數變異資料面臨的問題,並且它不需要事先決定分群的組數。γ-SUP需要決定一個會影響組數的參數τ,然而該篇作者建議的主觀挑選參數的方法與分析結果的好壞並沒有確定的根據。在這個研究中,我們希望能根據穩定性的概念來發展出挑選。γ-SUP參數的方法。穩定的集群分析在於它的分群結果能可被重覆很多次,因此我們利用重複抽樣的方法測量評估穩定性的指標。根據模擬的分析證明我們提出的方法能夠找出適當的參數,進而我們將這個方法應用在自閉症的拷貝數變異資料中。

並列摘要


Copy number variation (CNV) is a type of structural variation on DNA segment, which is reported to be associated with a number of complex diseases. Array-based technology enables fasting scanning large numbers of CNV, and many statistical strategies are developed for the estimation of copy number from experimental data. The challenge comes from estimating discrete value of the copy numbers using continuous signals calling from a set of markers. Another complexity resides in simultaneously performing association testing between CNVs and diseases. At the first stage of CNV analysis, CNV regions can be searched in relation to the trait of interest from genome-wide data. Because CNVs are rare and with low effect size, it is generally difficult to compare the frequency between cases and controls using the traditional statistical methods. Recently, DNA pooling strategy is adopted to save genotyping cost. However, CNV detection is even more challenging using pooling data. The first aim of this study is to develop a series of procedure to detect the associations between CNV and trait of interest using pooling strategy. We set a series of criteria for filtering out the noise of data and to reduce false-positive findings. We applied our procedures in an empirical CNV dataset of bipolar disorder. We first defined CNV regions for every pool. Second, we select CNV regions with different patterns between case and control pools. Finally, we integrated our findings into the mapped gene functions and the results of previous studies to explore the associations between CNV and bipolar disorder. At the second stage of CNV analysis, we would apply clustering procedure to estimate copy numbers from the validated signals of the specified CNV region. In the situation of poor clustering quality and outlier-problem in CNV data, it is more challenging to identify correct clusters. γ-Self-updating process (SUP) is a newly developed method that could overcome the above mentioned problems, and it is also robust to the predetermination of the number of classes. The performance of γ-SUP relies on the selection of a tuning parameter τ. However, the relationship between the subjective selection rule and performance of final clustering output is unclear. The second aim of this study is to develop a selection procedure of τ in γ-SUP, based on the idea of stability. In our method, the stability is defined to be the reproducibility of clustering results, and a measure of instability is constructed using resample scheme. Simulation studies show that the proposed selection criterion does provide adequate value of τ. Furthermore, we also apply applied this method in an empirical CNV dataset of autism.

參考文獻


1. Feuk, L.; Carson, A.R.; Scherer, S.W., Structural variation in the human genome. Nat Rev Genet 2006, 7, 85-97.
2. Zhang, F.; Gu, W.; Hurles, M.E.; Lupski, J.R., Copy number variation in human health, disease, and evolution. Annu Rev Genomics Hum Genet 2009, 10, 451-481.
3. Zollner, S.; Teslovich, T.M., Using gwas data to identify copy number variants contributing to common complex diseases. Statistical Science 2009, 24, 530-546.
4. Macgregor, S.; Visscher, P.M.; Montgomery, G., Analysis of pooled DNA samples on high density arrays without prior knowledge of differential hybridization rates. Nucleic Acids Res 2006, 34, e55.
5. Sham, P.; Bader, J.S.; Craig, I.; O'Donovan, M.; Owen, M., DNA pooling: A tool for large-scale association studies. Nat Rev Genet 2002, 3, 862-871.

延伸閱讀