高維度資料下三階段交互作用選取與模式建立策略

近十多年來，由於生物醫學科技在基因研究的快速發展，產生大量變數遠多於觀察數的資料，也就是一個觀察樣本擁有很多自變數，我們稱之為高維度資料，實際資料如：基因表現資料或單核苷多型性。在高維度基因表現資料下，疾病發生與否為反應結果變數，若想知道哪一個基因可能與疾病有統計上顯著關係，或想知道靠近哪一個基因會導致疾病發生，無法利用一般統計方法處理。必須先對高維度資料進行降階等統計處理後，才能進行後續的統計分析。近年來，討論染色體上基因間的交互作用對於外表型或是疾病的發生在遺傳統計是個熱門的話題。在許多基因研究顯示，大部分的複雜疾病並非單一基因所導致，而是一個以上的基因間交互作用所共同影響。為了解決這樣的問題，本研究提出一交互作用效應選取和模式建立的策略。高維度資料在全基因相關掃描出和疾病可能有關的基因後，利用PCA縮減維度為低維度資料，將性質相近的基因組成一個主成份，總共產生若干個獨立的主成份。因為模式要加上兩兩交互作用效應，導致會有大量參數需要估計。我們的目的是想有效減少交互作用，先利用單一參數分數檢定一一檢定所有交互作用效應，再將PCA的主成份為主效應加上單一參數分檢定選取出有統計顯著的交互作用效應，最後利用LASSO進行變數選取與模式建立。透過模擬研究可知此策略可以降低配適LASSO模式的運算時間與提高選取變數的正確率。

關鍵字

高維度資料；交互作用；最小絕對值壓縮和選取；主成份分析；分數檢定

並列摘要

Due to the breakthroughs in biomedical technology, many studies have produced data containing a large number of variables exceed the number of observations. Such data are called high-dimensional data, such as the gene expression profiles and single nucleotide polymorphisms. Consider the high-dimensional data with gene expression profiles and binary disease status is the outcome variable of interest. It is of interest to study how the gene expression profiles are associated with the disease status. The standard statistical approach cannot be used directly to analyze such high-dimensional data due to the curse of dimensionality. Typically, we have to reduce the dimension of original data before performing the subsequent statistical analysis. In recent years, exploration of the interactions between genes on the chromosome phenotypic or disease is an interesting topic in genetic statistics. Many genetic studies showed that complex diseases are not only caused by a single dominant gene, but also the combined effect of more than one gene interactions. In this study, our aim is to detect the gene interactions which are correlated with complex disease. For the analysis of high dimensional data, the first step is usually to use PCA for reducing the dimension and then selecting the principle components as the main effects in the model. We propose an effective selection strategy for the potential interactions following the first step. Specifically, we use one-parameter score test to detect the interactions one by one at the second step. Then, the final step is to perform LASSO by considering both the main effects and interactions selected at the first and second steps to obtain the final model. Our limited simulation studies showed that the proposed selection strategy using one-parameter score test for selection interactions can reduce the computation time in LASSO and raise the correct rate of selecting true variables in the model.

並列關鍵字

High-dimensional data ； Interaction effect ； Least absolute shrinkage and selection operator ； Principal component analysis ； Score test

參考文獻

1. Basu S, Pan W, Shen X, Oetting WS. (2011). Multilocus association testing with penalized regression. Genet Epidemiol 35: 1–11.

2. Bateson W. (1909): Mendel’s Principles of Heredity. Cambridge University Press.

3. Casella, G., and Berger, R. L. (2002). Statistical Inference. Thomson Learning.

4. Combarros O, Cortina-Borja M, Smith AD, Lehmann DJ. (2009). Epistasis in sporadic Alzheimer’s disease. Neurobiol Aging 30, 1333-1349.

5. Czepiel, S. A., Maximum likelihood estimation of logistic regression models: theory and implementation.

國際替代計量

高維度資料下三階段交互作用選取與模式建立策略

主題瀏覽