透過您的圖書館登入
IP:18.191.228.88
  • 學位論文

建立臨床的微陣列整合分析資料庫和個體間與個體內基因表現差異之研究

The database construction for microarray clinical meta-analysis and inter/intra individual variance gene expression study

指導教授 : 許志楧

摘要


中文摘要 在過去十年來,基因晶片應用於臨床研究日益漸增,不論是在癌症分型,尋找疾病之標誌基因,甚至是對治療預後結果的判斷,都有廣泛的應用。然而因為臨床檢體的複雜性常導致結果的爭論性,主要在於臨床檢體所導入的生物誤差過大且難以估計,例如組織檢體中細胞的均質性,個體間和群體間的遺傳差異。因此我們在此研究中探討生物誤差的估計對於研究結果的影響性。我們以胎盤檢體為模型設計實驗,將臨床上常遇到的誤差分成三個層次來探討: 技術誤差,解剖差異和個體差異。另外以尋找“個體間差異基因”為例子,以統計跟生物的觀點來探討當低估誤差時所造成的影響。我們發現當低估誤差時,不可避免的會將沒有統計意義的顯著基因判定成顯著。而且會干擾生物功能上的預測。再者我們發現使用高一點的表現差異為篩選條件,有助於減少甚至消除低估誤差所造成的影響,但是這樣的代價是使得顯著基因個數的減少。在這部分的研究中我們論證了正確估計誤差的重要性。而使用高一點的表現差異也許是一個手段,來改善當實驗設計中無法正確地來估計生物誤差所造成的影響。 要正確的評估生物誤差,其中很重要的一點就是大量的臨床檢體進行實驗,然而大量實驗所需的龐大研究經費不是一般研究所能負擔的,所幸在近年來MGED社群的“基因晶片最少資訊規範”被廣泛推廣,大量可用的實驗數據可以在網路的資料庫中取得,因此統合不同實驗數據的整合分析在基因晶片的研究中越來越受到重視。然而在公開的基因晶片資料庫中找尋特定所需資料仍存在實務上的困難,再者不同的分析方法及低品質的資料則會分析中導入更多的誤差,而降低整合分析結果的可靠性,為了得到大量可靠臨床實驗數據以進行整合分析,我們建立了一個臨床上的基因晶片整合分析資料庫M2DB,收集了10202片Affymetrix基因晶片的臨床實驗數據,提供友善的介面可利用檢體的臨床資料來進行篩選。在數據處理部分,所有數據皆進行統一化的處理,和提供品質控制參數的設定以去除低品質的實驗,另外也提供未處理之數據以供使用者進行特定的分析。線上查詢介面用五個臨床資料註釋的彈性組合來篩選查詢資料,其包含了生理狀態和組織來源部位。這些註釋根據於GEO資料庫,ArrayExpress資料庫,和所屬的研究論文上的資訊,再以人工的方式和公認的字彙來註解。處理數據和品質控制的演算法皆來自以發表之研究。M2DB提供了一個較低的門檻和統合的處理流程來進行整合研究。我們希望藉由這個資料庫可以促進基因晶片整合研究的發展。 在此研究中我們接下來利用M2DB進行基因晶片整合分析來找尋適合臨床研究上定量聚合酶連鎖反應所使用的基準基因。定量聚合酶連鎖反應是基因表現定量上的黃金標準,不同實驗技術所得的基因表現量皆以其為最後標準,然而其準確性非常仰賴於數據歸一化時所採用的“基準基因”。一直以來持家基因常被認定為“基準基因”,但是近年來越來越多研究顯示,持家基因的表現常會受到疾病,藥物等因素所影響,進而造成定量聚合酶連鎖反應結果的不準確性。藉由M2DB,在統一化的資料處理和資料的品質控制後,4804個分屬於13種器官/組織類別和4種身體狀態的臨床檢體可納入此研究。計算出每個基因在不同生理狀態和不同器官中其基因表現的變動性。找出個別器官中最穩定表現的基因即為我們所認定的基準基因。有102個基因在多個器官/組織類別中被判定為“基準基因”。在進一步分析中,這些基因在之前的研究中常被認定為持家基因,且有將近71%在Gene Ontology中屬於Gene Expression (GO:0010467)之分類。根據我們的結果,研究者可以選出一個或者是多個“基準基因”來做為定量聚合酶連鎖反應歸一化的依據。

並列摘要


Abstract Over the last decade, microarray studies have had a profound impact on clinical research including cancer classification, seeking biomarkers of diseases, and prognosis prediction. However, the complexity of clinical samples could lead to inconsistent results. It is because the influences of biological variance, introduced from clinical samples, are hard to estimate, such as the heterogeneous of cells in clinical samples, individual and population variance in genetics. In this research, we investigated the influence of estimating the biological variance for the conclusion of research. To break intra- and inter-individual variance in clinical studies down to three levels: technical, anatomic, and individual, we designed experiments and algorithms to investigate three forms of variances. As a case study, a group of “inter-individual variable genes” were identified to exemplify the influence of underestimated variance on the statistical and biological aspects in identification of differentially expressed genes. Our results showed that inadequate estimation of variance inevitably led to the inclusion of non-statistically significant genes into those listed as significant, thereby interfering with the correct prediction of biological functions. Our data demonstrates that an appropriate evaluation of variance is critical in selecting significant genes of differential expression. To estimate biological variance precisely, one major point is performing a large number of clinical experiments but it is also too expensive. Fortunately, largely available data could be accessed by public repositories with rapid development of microarray. Meta-analysis of substantial amounts of accumulated data, by integrating valuable information from multiple studies, is becoming more important in microarray research. However, collecting data of special interest from public microarray repositories often present major practical problems. Moreover, including low-quality data may significantly reduce meta-analysis efficiency. To obtain large reliable clinical microarray data, we constructed a microarray meta-analysis database (M2DB) for clinical studies. It is a human curated microarray database designed for easy querying, based on clinical information and for interactive retrieval of either raw or uniformly pre-processed data, along with a set of quality-control metrics. The database contains more than 10,000 previously published Affymetrix GeneChip arrays, performed using human clinical specimens. M2DB allows online querying according to a flexible combination of five clinical annotations describing disease state and sampling location. We hope that this research will promote further evolution of microarray meta-analysis. In the following, we utilized M2DB to perform meta-analysis for identifying reference gene for quantitative RT-PCR in clinical studies. The accuracy of quantitative real-time PCR (qRT-PCR) is highly dependent on reliable reference gene(s). Some housekeeping genes which are commonly used for normalization are widely recognized as inappropriate in many experimental conditions. After uniform data preprocessing and data quality control, 4,804 Affymetrix HU-133A arrays performed by clinical samples were classified into four physiological states with 13 organ/tissue types. We identified a list of reference genes for each organ/tissue types which exhibited stable expression across physiological states. Furthermore, 102 genes identified as reference gene candidates in multiple organ/tissue types were selected for further analysis. According to our results, researchers could select single or multiple reference gene(s) for normalization of qRT-PCR in clinical studies.

參考文獻


1. van 't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT et al: Gene expression profiling predicts clinical outcome of breast cancer. Nature 2002, 415(6871):530-536.
2. Hoshida Y, Villanueva A, Kobayashi M, Peix J, Chiang DY, Camargo A, Gupta S, Moore J, Wrobel MJ, Lerner J et al: Gene expression in fixed tissues and outcome in hepatocellular carcinoma. N Engl J Med 2008, 359(19):1995-2004.
3. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X et al: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000, 403(6769):503-511.
4. Ein-Dor L, Kela I, Getz G, Givol D, Domany E: Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 2005, 21(2):171-178.
5. Michiels S, Koscielny S, Hill C: Prediction of cancer outcome with microarrays: a multiple random validation strategy. Lancet 2005, 365(9458):488-492.

延伸閱讀