透過您的圖書館登入
IP:3.144.98.13
  • 學位論文

探討轉錄體序列組裝對序列回貼以及基因表現定量的影響

Effect of de novo transcriptome assembly on quality of read mapping and transcript quantification

指導教授 : 歐陽彥正
共同指導教授 : 陳倩瑜(Chien-Yu Chen)

摘要


利用核糖核酸的定序技術可以了解轉錄體在不同的生長階段或是生理狀態下的表現情形,進而了解生物體內的基因調控途徑。除此之外,由於核糖核酸的定序技術不需要事先使用參考的基因體或轉錄體序列,因此也特別適用於還未有詳盡註解基因體或是未曾被研究過的物種上。在沒有參考序列的情況下,研究者必須要利用定序出的小片段核糖核酸進行轉錄體序列的組裝與重建。然而,組裝過程中產生的多餘或是錯誤序列,很有可能對後續的定量分析造成嚴重的影響。因此,如何正確地用計算的方式估計轉錄體的表現量便是個相當重要的課題。本論文旨於評估轉錄體序列組裝的品質是如何影響轉錄體表現量的定量演算法。組裝後的序列會被分類為十二類不同意義的組裝序列類別,並且針對每個類別進行定量的分析與比較。結果顯示了在生物體中的轉錄體即便具有大量的相似性,對參考基因體或轉錄體的定量並沒有太大的影響,但卻會導致組裝錯誤進而造成組裝過後的序列定量具有較大的誤差,尤其針對於把多條相似序列的合併成一條序列的組裝錯誤,會得到最為嚴重的結果。除此之外,本論文也提出了一個預測組裝錯誤的監督式學習演算法,能幫助將來的研究者對於分析的組裝序列有更進一步的瞭解。總結來說,本研究利用多種組裝與定量演算法的比較,提供研究者在無參考序列物種的轉錄體組裝與定量更多的了解。

並列摘要


Correct quantification of transcript abundance is essential to understand the functional products of the genome in different physiological conditions and developmental stages. Recently, the development of high-throughput RNA sequencing (RNA-Seq) allows the researchers to perform transcriptome analysis for the organisms without the reference genome and transcriptome. For these practical projects, de novo transcriptome assembly must be carried out prior to quantification. However, a large number of fragmented contigs and redundant sequences produced by the assemblers may result in unreliable abundance estimation. In this regard, this study first investigates how assembly quality might affect the quality of read mapping and count estimation, and then proposes a classifier to characterize the assembled sequences. By the experiments and analyses conducted in this study, several important factors that might seriously affect the accuracy of the RNA-Seq analysis were comprehensively discussed. First, the effects of twelve distinctive assembly groups along with the intrinsic similarity presented in the reference transcriptome on quantification quality were examined. The results showed that the similar subsequences presented in the reference transcriptome only slightly influence mapping quality, but lead to many poorly-assembled contigs. The contigs that merge multiple transcripts into one most heavily decreased the reliability of abundance estimation. Second, a predicting algorithm was proposed to help researchers estimate the quantification reliability for further analyses. In summary, the analytic results conducted in this study provides valuable insights for future studies related to RNA-Seq data analysis.

參考文獻


1. Genome, K.C.o.S., Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J Hered, 2009. 100(6): p. 659-74.
2. i, K.C., The i5K Initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J Hered, 2013. 104(5): p. 595-600.
3. Ekblom, R. and J. Galindo, Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity (Edinb), 2011. 107(1): p. 1-15.
4. Grabherr, M.G., et al., Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol, 2011. 29(7): p. 644-52.
5. Chang, Z., et al., Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome Biol, 2015. 16: p. 30.

延伸閱讀