透過您的圖書館登入
IP:18.191.68.50
  • 學位論文

病毒分類之研究-利用基因體 DNA 序列

A Study of Virus Classification via Genomic DNA Sequences

指導教授 : 王經篤
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


有別於傳統利用形態學,現今可利用病毒基因體序列,從分子生物學的角度,來提供一個新方法來作病毒分類。為了在向量空間下,利用現有可用之分類器,將每一個病毒轉換成具有代表性的向量是重要的。在本論文採用k-mer(k) 作為樣式抽取,並且利用樣式分佈的 entropy(熵) 作為樣式權重,將病毒實例 (基因體序列) 轉成向量,以作為病毒分類實驗的輸入。為了檢查 DNA 核酸序列中,編碼片段 (coding) 與非編碼的片段 (non-coding) 效能之不同,這裡將序列分別抽取出作為4個輸入種類,分成”ALL”、”Cod- ing”、”NonCoding” 和 ”DirectedCoding”等來做分類比較。實驗的病毒基因體是由 NCBI 下載,包括22個病毒科 (family),其中包含1,601種病毒; 同時實驗中利用1到6的 k 值來評估。實驗結果顯示,使用”ALL”類型之序列,在 k 值等於5的時候,利用 SVM 分類器所得到的最高正確率為95.6%。而且,利用”DirectedCoding”可以比”Coding”,得到較高的正確率。令我們意外的是使用”NonCoding”類型之序列,竟然可以在 k 值等於6時,得到高達90%的正確率,這個觀察隱約透露出在非編碼的片段仍保有一些資訊,值得由生物學家做進一步的研究。

並列摘要


Due to the availability of virus genome sequences nowadays, there provides a new approach to virus classification from the view point of molecular biology point of view, instead of from traditional morphol- ogy. To use the classifiers available in the vector space model, it is important to transfer the instances of virus into representative vectors. To transfer the instances of viruses (genomic sequences) into vectors as the input of experiments for virus classification, in this study, we adopted the k-mer(k) approach for pattern extraction and used the entropy of pattern distirbution for pattern weighting. To inspect the different effectiveness of coding/non-coding regions within one DNA nucleotide sequence, there were 4 types, ”ALL”, ”Coding”, ”NonCod- ing, and ”DirectedCoding”, of sequences extracted individually as the input for classification comparison. Experimental resources of viral genomes were downloaded from the NCBI and included 22 virus fami- lies consisting of ”1,601” virus species. Meanwhile, the values of the k ranged from 1 to 6 were evaluated for experiments. The results showed that the highest accuracy achieved by well known SVM classifier was 95.6%,by using the sequences of type ”ALL” when k = 5 . Further- more, the accuracy achieved via the ”DirectedCoding” was higher than that avhieved via the ”Coding”. It was out of our expectation that the accuracy achieved by using the sequence type of ”NonCoding” was as high as ”90%” when k = 6. This observation revealed that some information conserved in non-coding region (that)where worthy for further investigation for biologist.

參考文獻


[6] 楊繼江. Introduction to Virology. 藝軒, 2001.
[9] Internationl committee on taxonomy of virues.
[22] Critian I. Castillo-Davis. The evolution of noncoding dna: how much junk, how much func? TRENDS in Genetics, 21(10):533–536, October 2005.
[23] Bruce Croft, Donald Metzler, and Trevor Strohman. Search Engines: Informa-tion Retrieval in Practice. Addison-Wesley Publishing Company, USA, 2009.
[24] Ethem Alpaydin. Introduction to machine learning. The MIT Press, 2004.

延伸閱讀