透過您的圖書館登入
IP:3.137.166.124
  • 學位論文

發展一運用RNA定序資料鑑定病原體之演算法

Development of a Fast Algorithm for Pathogen Identification through RNA-seq

指導教授 : 莊曜宇

摘要


如何早期診斷由病毒,細菌或是黴菌等病原體引起之感染性疾病為目前臨床研究的重大課題之一。除了傳統的菌種及病毒鑑定方式之外,隨著次世代定序技術的發展,運用次世代定序技術找尋可能的病原體為一有效的鑑定方式。世界上的研發團隊已發展了數個方法來執行來執行菌種鑑定工作。然而這些開發的演算法需耗費大量的電腦計算運算時間及運算資源,以至於在實務運用遭遇到困難。為此我們針對了病原體鑑定開發了一高效能的新穎演算法。 此的演算法使用RNA定序資料,經由四個演算步驟進而鑑定病原體之基因序列片段。首先,將定序資料對比於人類參照基因序列,並保留非人類基因序列的資料進行下一步分析;第二步將非人類基因序列,進行全新序列組裝,透過序列重疊、延伸將序列串接成長序列片段;接著利用統計分析模型將鑑定其組裝之精準度。最後我們將通過統計檢定的長片段,利用BLAST工具確認其來源物種 。 本實驗運過資訊模擬資料以及RNA-seq實驗數據進而評估本演算法之效能。模擬及真實資料的分析結果顯示,本演算法皆呈現高度的精準度與敏感度。再與其他三種演算法的比較分析結果顯示,我們開發的演算法有較高的運算效能。我們將此方法應用於子宮頸癌,肺腺癌以及大腸癌的資料組上,試圖識別可能與這些癌症可能有相關的致病原,分析結果成功地找尋到各種癌症可能相關的病原體。 總結而言,本實驗發展的新穎演算可準確且有效率的經由RNA-seq資料檢測出可能的病原體。且本方法之運算效良非常良好,可有效地工作時間,相信這個演算法的開發將有助於病原體檢測的研究發展。

關鍵字

次世代定序 RNA定序 病原體

並列摘要


The diagnostic of virus, bacterial or fungus in early stage of infectious disease has been an important issue in clinical research. Except for strain or virus identification by traditional labor-intensive in vitro experiments, in-silico methods have been developed for pathogen identification on account of the innovation of next-generation sequencing. Research groups over the world have developed several methods. However, these in-silico methods are still time-consuming and compute-intensive, so that they occur practical obstacles. To address these issues, we developed an accurate and efficient algorithm for pathogen identification. Here we presented a novel algorithm to identify pathogens in four algorithmic steps through RNA-seq. First, the reads of sequences were aligned to the reference genome of human and those unable to be aligned were retained for subsequent analysis; Secondly, the retained reads were assembled to construct contigs of pathogens by repeated region of retained reads; Next, a statistical model was applied to the putative transcript contigs to remove fake contigs resulting from random assembly. We then applied BLAST to the contigs that passed the statistical test to identify the species and strains of the pathogens. To evaluate the performance, we adopted both simulation and real data sets that contains samples with pathogen infections. The results of both simulation and real data show that our algorithm have high sensitivity and accuracy. We compared our method with the other three methods and demonstrated that algorithm we developed has higher effectiveness. Furthermore, we also applied our method to the cervical cancer, lung adenocarcinoma and colorectal cancer dataset for identifying possible pathogens associated with these three kinds of cancers. In summary, our method is accurate and effective in detecting pathogens using RNA-seq data from patient samples. Moreover, the efficiency and short working time of our proposed method has enabled the use of large data set in pathogenic studies.

並列關鍵字

next generation sequencing RNA-seq pathogen

參考文獻


1. Moore, P.S. and Y. Chang, Why do viruses cause cancer? Highlights of the first century of human tumour virology. Nature reviews. cancer, 2010. 10(12): p. 878-889.
2. Sarid, R. and S.-J. Gao, Viruses and human cancer: from detection to causality. Cancer letters, 2011. 305(2): p. 218-227.
3. Boshoff, C. and R. Weiss, Kaposi's sarcoma-associated herpesvirus. Advances in cancer research, 1998. 75: p. 57-87.
4. Walboomers, J.M., et al., Human papillomavirus is a necessary cause of invasive cervical cancer worldwide. The journal of pathology, 1999. 189(1): p. 12-19.
5. Mineta, H., et al., Human papilloma virus (HPV) type 16 and 18 detected in head and neck squamous cell carcinoma. Anticancer research, 1997. 18(6B): p. 4765-4768.

延伸閱讀