利用千人基因組資料探討轉錄因子結合位個體化差異

基因調控 (gene regulation) 是維持細胞運作不可或缺的一個重要機制。因此，生物系統如何調控基因的表現一直是科學家們很重視的一個研究主題。基因調控細胞運作可以分成很多層面，包含控制基因表現、mRNA轉錄及剪接、蛋白質的後修飾作用等等。而其中，本論文主要想探討的是透過轉錄因子(transcription factor) 與雙股DNA 之間的相互作用，對基因表現(gene expression)產生活化或抑制之影響。在人類基因組的30億個鹼基當中，目前所知具有生物意義的片段如基因、轉錄因子結合位(transcription factor binding site, TFBS)等等僅佔DNA 的一小部分，而其中，轉錄因子如何辨識其結合位，進而達到基因調控的目的是非常重要的研究議題；轉錄因子與DNA結合的片段大小大約為5∼15個核苷酸，轉錄因子與其結合位的鍵結強弱亦可能影響其調控目標基因的基因表現。 1990年代，人類基因組定序計畫 (Human Genome Project) 啟動，受限於當時的技術，在投入大量資金與人力後，終於在2001年完成人類23對染色體共30億個鹼基的定序草圖，此為人類基因體學上一重大里程碑。隨著生物技術的發展及電腦計算成本的下降，基因定序技術的演進也一日千里。2008年，千人基因組計畫 (1000 Genome Project) 啟動，計畫在三年內利用更快、更便利的定序技術，完成超過一千人的基因組定序。2012年，共1,092組的基因組定序結果發表；迄今，該計畫最新公開之資料組已包含2,504人之基因組定序資料。人類基因組草圖的完成，足以讓研究人員開始針對轉錄因子的結合位進行高通量的篩選，越來越多的個人基因組資訊，更提供豐富的研究題材讓我們能一窺轉錄因子結合位的個體差異。本論文的研究目的即在利用千人基因組資料，探討轉錄因子結合位之個體化差異，並探討其在未來基因檢測之可能性。本論文蒐集 JASPAR 公開資料庫之34個人類轉錄因子之結合位資料，結合千人基因組計畫的序列變異資料，探討轉錄因子結合位的個體化差異。本論文所得之分析數據顯示，JASPAR所標示的轉錄因子結合位，僅有約3%的鹼基位置有觀察到個體差異，而有個體差異的位置，並不完全符合原先對該轉錄因子結合位的特徵描述，有些個體差異是發生在特徵描述上暗示不容許發生變異的位置。為解釋此不一致的現象，本論文利用線上工具PiDNA，透過蛋白質DNA複合物(protein-DNA complex)預測所得之轉錄因子結合位特徵，探討可能被忽略的轉錄因子結合位序列次級樣式(minor form)。本論文最後探討未來於個人基因診斷應用時，如何利用現有生物資訊工具與公開資料庫的資訊，評估發生於轉錄因子結合位之序列變異的重要性，期望能提供未來個體化醫療之相關應用作為參考。

關鍵字

轉錄因子；轉錄因子結合位；轉錄因子結合位特徵；個體化差異；千人基因組計畫

並列摘要

Gene regulation is essential and important for maintaining cellular functions. Therefore, how biological system regulates gene expression is a very important research topic for researchers. Gene regulation of cell functioning can be divided into many parts, including gene expression, mRNA transcription and splicing, post-translational modification, etc. This study aims at exploring the activation and inactivation effect of gene expression, through the interaction between transcription factors and double-stranded DNA. Among the three billion base pairs of human genome, some biological significant fragments such as genes or transcription factor binding sites account for only a small portion of DNA. The size of transcription factor binding motifs is about 5 to 15 nucleotides. Accordingly, how to identify transcription factor binding sites and how they achieve gene regulation is a very important research issue. Meanwhile, the bonding strength between transcription factors and their binding sites may also affect the regulation of gene expression. In the 1990s, the Human Genome Sequencing Project launched. Limited to the technology at that time, this project spent a lot of money and manpower. Finally, 23 human chromosomes were completed sequencing in 2001, including in total three billion bases. This is a considerable milestone on human genome research. With the development of biotechnology and the reducing cost of computer calculation, the technology of genome sequencing started to grow fast. In 2008, the 1000 Genomes Project started, planning to use faster and easier sequencing technology, to sequencing more than a thousand human genomes within three years. In 2012, in total 1,092 human genomes have been published. So far, the latest version dataset of this project has already contained 2,504 human genome data. The completion of human genome allows researchers to perform high-throughput screening of transcription factor binding sites. More and more individual genome datasets, provided a wealth of research themes letting us to glimpse the differences within individual transcription factor binding sites. The objective of this study is using the data of 1000 Genomes Project to explore individual variations in transcription factor binding sites, and the possibilities of its applications on genetic tests. This study collected the binding site data of 34 human transcription factors in the JASPAR database, and combined this information with the variant data of the 1000 Genomes Project to explore individual variations in transcription factor binding sites. Analysis from the study shows, the JASPAR-denoted transcription factor binding sites have only about 3% of position with individual variations. Furthermore, the positions with individual variations do not consistent with the original motifs of the transcription factor binding sites. Some individual variations occur at the positions where the corresponding motif implies not allowing variations. In order to further investigate the rationale behind this inconsistency, this study used an online tool named PiDNA, which predicts the binding motif of a DNA-binding protein using protein-DNA complex structures. This study employed such binding motifs to explore the potential minor form that might be omitted previously. At the end of this study, it discusses the future application of personal genetic diagnosis, and how to use existing bioinformatics tools and public databases to assess the importance of the occurrence of variants observed in transcription factor binding sites. It is expected that this study can provide novel insights for individual genetic tests in the personalized medecine.

並列關鍵字

transcription factor ； transcription factor binding site ； transcription factor binding motif ； 1000 Genomes Project

參考文獻

[1] S. Ahmad, M. M. Gromiha, and A. Sarai. Analysis and prediction of dna-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics, 20(4):477–486, 2004.

[2] T.L.Bailey,N.Williams,C.Misleh,andW.W.Li.Meme:discoveringandanalyzing dna and protein sequence motifs. Nucleic acids research, 34(suppl 2):W369–W373, 2006.

[3] A. Barski, S. Cuddapah, K. Cui, T.-Y. Roh, D. E. Schones, Z. Wang, G. Wei, I. Che- pelev, and K. Zhao. High-resolution profiling of histone methylations in the human genome. Cell, 129(4):823–837, 2007.

[4] X. Chen, H. Xu, P. Yuan, F. Fang, M. Huss, V. B. Vega, E. Wong, Y. L. Orlov, W. Zhang, J. Jiang, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell, 133(6):1106–1117, 2008.

[5] . G. P. Consortium et al. An integrated map of genetic variation from 1,092 human genomes. Nature, 491(7422):56–65, 2012.

國際替代計量

利用千人基因組資料探討轉錄因子結合位個體化差異

主題瀏覽