透過您的圖書館登入
IP:3.144.202.167
  • 學位論文

利用穩健重覆排序方法偵測表現差異及其應用於分析混合樣本之全基因體掃描資料

A Robust Re-Rank Approach with Application to Pooling-Based GWA Study Data

指導教授 : 洪弘

摘要


近年來隨著研究技術的蓬勃發展, 研究者愈來愈容易取得同時含有成千上萬個變項個數的資料庫, 使得樣本個數相較之下變得非常小。在這種變項個數遠大於樣本個數的情況之下, 傳統常用來偵測兩組差異的 t 統計量會因為變異估計不夠穩定而不太適用。另一方面, 同樣是用來偵測兩組差異的 ROC 曲線下面積 (AUC), 雖然屬於較不受分配限制的無母數方法, 仍然會因為重覆數值出現的頻率太高, 造成排序挑選的困擾。為了兼顧檢定力和穩健力, 改變傳統給定排序值的方法, 將其重新定義為在同一樣本內不同變項之間的排序, 會更加適用。在此研究中, 我們提出一種重覆排序方法, 以「rank-over-variable」概念為基礎, 再配合「random subset」和「re-rank」兩種技巧, 可用來幫助研究者在分析變項個數遠大於樣本個數的資料型態時,能有效挑選出在兩組間有差異的變項。為了評估此方法,我們以 GAIN-MDD 資料檔為基礎進行模擬分析,驗證相較於 t 統計量和 AUC,我們所提出的重覆排序方法能更有效地偵測出真正在兩組間有差異的變項,同時也較不容易受到小樣本數和實驗誤差的影響。最後, 我們實際將新方法應用於混合樣本之全基因體掃描研究, 偵測出可能與雙極性情感疾病相關的基因, 提供研究者進行更進一步的探討。

並列摘要


Recently, more and more researches encounter the problem where the data objects have an extremely large number of variables while the available sample size is relatively small. To detect the difference between two populations in this situation, the widely used two sample t-test would fail to apply due to its instability in estimating variances. The non-parametric counterpart, AUC, will face the problem of tied values and also fail. To improve the detection power while keeping the robustness, the idea of ``rank-over-variable' is more appropriate to analyze large-p-small-n datasets. In this study, we propose a robust re-rank approach to overcome the above-mentioned difficulties and reduce the influence of enormous features in the large-$p$-small-$n$ situation. In particular, we obtain a rank-based statistic for each feature based on the concept of "rank-over-variable". Techniques of "random subset" and "re-rank" are then iteratively applied to ranking features. Finally, the leading features in the constructed ranking list will be selected for further research. To evaluate the performance of our proposed re-rank approach, we conduct several simulation studies based on the GAIN-MDD dataset. Compared with the t-statistic and AUC, our re-rank approach is able to identify more pre-defined truly relevant SNPs and robust for different pool number and pooling error. Furthermore, we also demonstrate a real data analysis to explore the markers associated with bipolar disorder.

參考文獻


Academy of Sciences of the United States of America, 98, 5116-5121.
[1] Alvo, M., Liu, Z., Williams, A., and Yauk, C. (2010). Testing for mean and correlation changes in microarray experiments: an application for pathway analysis. BMC Bioinformatics, 11, 60.
[2] Avvakumov, N., and Cote, J. (2007). The MYST family of histone acetyltransferases and their intimate links to cancer. Oncogene, 26, 5395-5407.
[3] Barrett, J. C., Fry, B., Maller, J., and Daly, M. J. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21, 263-265.
[4] Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.

延伸閱讀