透過您的圖書館登入
IP:3.140.198.173
  • 學位論文

基於狀態空間模型尋找顯著表現基因之研究

Identification of differential expressed genes based on state space model

指導教授 : 陳中明

摘要


生物晶片技術的出現,對研究生命科學的學者提供了一個相當有力的工具,它可以同時分析上萬個基因的表現量與其相互影響的關係。在生物晶片的資料分析方法中,有一個很重要的課題,那就是如何從不同條件的基因表現中尋找有顯著差異的基因。 目前所發表的統計方法中,尚未有方法可以完全準確的找出有顯著表現的基因。在這些方法中大部分的方法都是基於two sample t-test,而這類型的方法利用實驗得來的觀測值去做檢定,但受限於生物晶片的重複實驗量太少,以及無法瞭解確實的資料分佈的關係,使得此類方法的估計結果是令人存疑的。 在本論文裡提出了一種找顯著表現基因的方法,它是屬於一種基於狀態空間模型(state space model)中的線性動態系統(linear dynamic system),可以在少量的樣本數下,由觀測值以及系統的雜訊中利用最佳化的疊代演算法,找出隱藏的實際值,能修正t-test中的mean與variance值,使其更接近真正的基因表現狀態,可以提升t-test檢定的效果。 我們分別利用模擬資料與唐式症的資料來測試本論文提出的方法,並與其他已發表的方法做比較,如two sample t-test,significance analysis of microarrays (SAM),Bayesian probabilistic framework (Bayes),local-pooled-error test (LPE) 等方法。比較方法之間效果的差異。 在模擬資料上的效果顯示,本論文提出的方法,在基因表現量較低的部分,其效果明顯比其他的方法要好。而在實際的唐式症資料上,本論文的方法可以找出與唐式症有顯著關係的基因,這是其他方法所找不到的。兩種資料測試的結果,顯示了本論文方法的效果較其他方法為佳。

並列摘要


The emergence of microarray technology provides a powerful tool for the researchers in life science. It may be used to analyze expression profiles and gene-gene interactions of ten thousands of genes at the same time. In microarray data analysis, one of the important topics is to find differentially expressed genes in different condition. Important as it is, none of the existing statistical methods is capable of identifying all the differently expressed genes completely correctly. Many of these methods are based on the two sample t-tests, which use observed values from microarray experiments to perform the statistical tests. With a limited number of repeated microarray experiments, the actual distributions of the gene expressions are hard to estimate. As a result, the results derived by these t-test based methods are questionable. A new method based on the state space model is proposed in this thesis for identification of differentially expressed genes from microarray data. The salient feature of the proposed method lies in its capability of identifying differentially expressed genes with only a few replications. The basic idea is to estimate the unobservable model parameters of the actual expressions based on a state space model using the observed microarray data, which is solved by Kalman filtering and EM algorithm. With the estimated model parameters, i.e., means and variances of the actual gene expressions, the accuracy of the t-tests is shown to be greatly enhanced. The performance of the proposed method has been compared to several previous approaches using the simulation data and the down syndrome data. These methods include two sample t-test, significance analysis of microarrays (SAM), Bayesian probabilistic framework (Bayes), and local-pooled-error test (LPE). The analysis results show that for the simulation data, the proposed method is particularly superior to the other tested methods when the gene expression is low. For the down syndrome data, the proposed method has succeeded in identifying several differentially expressed genes that were known to be related to down syndrome but can not be identified by the other methods. Both results support that the proposed method outperform the other tested methods.

參考文獻


[2] Baldi P and Long AD. (2001) A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes. Bioinformatics, 17, 509–519.
[3] Beal ML. (2005) A Bayesian approach to reconstructing genetic regulatory networks with hidden factors. Bioinformatics, 21,349-356.
[4] Benjamini Y and Hochberg Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B., 57, 289–300.
[5] Bescond M and Rahmani Z. (2004) Dual-specificity tyrosine-phosphorylated and regulated kinase 1A (DYRK1A) interacts with the phytanoyl-CoA α-hydroxylase associated protein 1 (PAHX-AP1), a brain specific protein. IJBCB, 37,775-783.
[6] Bolstad BM, Irizarry RA, Astrand M and Speed TP. (2003) A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics, 19,185-193.

延伸閱讀