透過您的圖書館登入
IP:18.191.102.112
  • 學位論文

利用同屬物種資訊提升無參考基因組轉錄體序列組裝與功能註解之完整度

Improving completeness of de novo transcriptome assembly and gene annotation by comparison of species within the same genus

指導教授 : 陳倩瑜

摘要


次世代定序技術的進步不僅可以提供了高通量的轉錄體序列資訊,更增進無參考基因組物種相關的研究,利用針對無參考基因組之序列組裝工具,短片段序列可以被組裝成轉錄體序列,為了進一步推測各轉錄體的序列功能性,我們需要註解這些轉錄體基因。針對這些無參考基因組的組裝序列,普遍的方法是使用序列比對工具,例如BLASTx,利用與蛋白質資料庫中的序列相似度比對,可以推測出序列轉譯成蛋白質後可能的功能。此研究分析了東方果實蠅(Bactrocera dorsalis)與瓜實蠅(Bactrocera cucurbitae)的轉錄體序列在模式生物資料庫中的序列比對後所能獲得的基因註解情形,研究顯示只有49 %的東方果實蠅序列 (約為24,800條組裝序列) 可以利用黃果蠅(Drosophila melanogaster)取得基因註解;對瓜實蠅來說,只有46% (約為25,400條組裝序列),此結果揭露了單純使用模式生物來做基因註解有一定的極限,因為待分析物種與模式生物在演化上的關係仍然有一定的差距。然而,若在比對序列相似度時能不只是利用最相似的模式生物,而是利用兩個更鄰近的物種,將可得到更多的註解,東方果實蠅與瓜實蠅屬於同一個屬(genus),因而有高度的相似特性和同源基因,本研究建立了一套分析方法,利用此同屬物種的序列相似特性做為連結,建立無方向性的連通分量圖(connected components),改善基因註解的完整度,另一方面,比較兩物種的組裝結果的統計分析時顯示,瓜實蠅的平均序列組裝長度約為東方果實蠅的兩倍長,此結果暗示比起瓜實蠅,東方果實蠅的組裝序列擁有更多不完整的組裝序列,利用建立連通分量的分析,本研究可提供一套未能組裝成一體,但本質上應該相連接的組裝序列名單,進而改善無參考基因組的序列組裝結果。 進行連通分量的分析時,為了確保序列高度相似的可靠性,本論文採用的序列比對參數之標準為:相似度高於80%、E-value小於10-20、比對到的蛋白質長度大於70個胺基酸,透過此標準獲得7,086個連通分量單元,利用本論文之建議策略,序列因為自身在連通分量中找到另一同屬物種中相關連的序列,透過其序列所擁有的註解,而提供自身潛在的基因註解,在僅利用黃果蠅做序列比對時,那些原本無法利用黃果蠅取得基因註解的序列之中,共有925條東方果實蠅序列、272條瓜實蠅序列可以獲得額外的基因註解。針對改善無參考基因組之序列組裝效果,使用連通分量的分析下建議共有1,919條東方果實蠅序列、71條瓜實蠅序列應該被接得更長,分別轉變為680與52條轉錄體序列;最後,透過資料庫的建立,本研究提供一個連通分量方法的線上分析平台,方便生物學家存取本研究的成果,研究員可以觀察使用連通分量後每個分量中的多重序列比對情形,促進未來的生物實驗設計以及後續應用。

並列摘要


The revolutionary advances of next-generation sequencing technology not only provide high-throughput sequencing data, but also considerably facilitate studies with regard to transcriptome without a reference genome. By means of de novo assembly, assembled transcripts can be retrieved from the sequencing reads. In order to infer the protein function of the assembled sequences, one conventional approach is to utilize the sequence similarity against the protein database by BLASTx. In this study, only 49% (approximately 24,800 sequences) of the assembled Bactrocera dorsalis (B. dorsalis) sequences can be annotated with Drosophila melanogaster (D. melanogaster) genes by BLASTx. For Bactrocera cucurbitae (B. cucurbitae), it is only 46% (approximately 25,400 sequences) of the assembled transcripts which can be annotated with D. melanogaster genes. It reveals an inevitable limitation when the target organism is evolutionarily distant from the model organism. Compared to the traditional approach, if the process of similarity comparison is not only against the most relative model organism, but also utilizes the assistance of much more closely-relative organism, it can further enhance the completeness of the annotation list. B. cucurbitae and B. dorsalis belong to the same genus, and share a high level of homology to each other. With the procedure of finding connected components (CCs), we can utilize the linkage of the similarity information from these two species for further improvement of annotation. On the other hand, the statistics of the assembly result has shown that the average length of B. cucurbitae assembled sequences is twice longer than that of B. dorsalis, suggesting that the assembly of B. dorsalis may contain much more incompletely assembled transcripts than the assembly of B. cucurbitae. Under the procedure of CCs analysis, we can leverage the CCs to improve the de novo assembly result, by providing a list of transcripts that could have been intrinsically joined together. A total of 7,086 CCs was obtained by using a strict criteria of the similarity parameters (identity higher than 80%, E-value smaller than 10-20 and alignment length longer than 70 amino acids). With the assistance of the mutually comparison among the sequences with the same Bactrocera genus, it suggested the potential annotation of the transcripts that cannot be provided when the transcripts are only compared with D. melanogaster sequences. For increasing the completeness of the annotation list, there are 925 B. dorsalis sequences and 272 B. cucurbitae sequences that can be additionally annotated with D. melanogaster genes. As well, for further improvement toward de novo assembly, a total of 1,919 B. dorsalis sequences are recommended to be concatenated into 680 longer transcripts. Similarly, a total of 71 B. cucurbitae sequences are suggested to be joined into 52 longer transcripts. Finally, a database was constructed to provide a user-friendly platform for the CC analysis and to assist the biologists retrieving the illustration of the relationship of sequence alignment within CCs.

參考文獻


Bentley, D. R. 2006. Whole-genome re-sequencing. Current Opinion in Genetics & Development 16(6): 545-552. DOI: 10.1016/j.gde.2006.10.009
Bostock, M. D3: Data-Driven Documents. Available from: http://d3js.org/.
Dhillon, M., R. Singh, J. Naresh and H. Sharma. 2005. The melon fruit fly, Bactrocera cucurbitae: A review of its biology and management. Journal of Insect Science 5.
Gibbons, J. G., E. M. Janson, C. T. Hittinger, M. Johnston, P. Abbot and A. Rokas. 2009. Benchmarking next-generation transcriptome sequencing for functional and evolutionary genomics. Mol Biol Evol 26(12): 2731-2744. DOI: 10.1093/molbev/msp188
Gille, C., W. Birgit and A. Gille. 2014. Sequence alignment visualization in HTML5 without Java. Bioinformatics 30(1): 121-122. DOI: 10.1093/bioinformatics/btt614

延伸閱讀