透過您的圖書館登入
IP:3.128.168.87
  • 學位論文

以分類、分群和關聯探勘法探討蛋白質序列的階層式分類

Exploring Hierarchical Classifications of Protein Sequences with Classification, Clustering and Mining Association Methods

指導教授 : 林宣華
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


由於現今生物技術迅速進步,定序完成的蛋白質序列數目快速增加,且公開於網際網路。但在這些蛋白質序列當中,只有少部份蛋白質的功能和結構經由人工的方式注釋完成,大部分已知蛋白質序列的附加資訊仍然是未知的。如果能將相似的蛋白質分類在一起,就能提供生物學家一些有意義的資訊,以幫助研究尚未注釋的蛋白質。網路上有許多公用的蛋白質分類資料庫,像是SCOP、Pfam等。因為用人工的方式去注釋並分類蛋白質是很耗時的,所以我們可以使用分群的方法將相似的蛋白質序列歸類在同一群組,經由同一群組之中已經注釋好的蛋白質去預測未知蛋白質的功能,因為相似的序列可能隱含著相似的功能。此外,我們也參考了分類方法,使用一組特徵或參數將每一條蛋白質序列特徵化以做分類。在本論文中,我們先使用BLAST或Smith-Waterman排比工具計算出蛋白質序列之間的相似度,並使用圖論中的shortest path概念將相似度作最佳化。為考慮將蛋白質功能之性質也整合到分群演算法中,我們以蛋白質的motif作為特徵,利用資訊檢索中的向量模型方法計算蛋白質功能的相似度。然後我們將兩個相似度圖形作合併,再使用凝聚式階層分群演算法將蛋白質作分群。另一方面,我們選擇了三組特徵,以使用SVM將蛋白質作分類。實驗結果顯示,我們的方法不盡理想,但仍可從實驗結果中觀察到一些資訊。

並列摘要


With the improvement of bio-technologies, the number of sequenced proteins is rapidly increased. Among these proteins, only a few proteins are manually annotated with functions and structures; most sequenced proteins are still unknown about curated information. If we can categorize similar proteins, it may provide significant information for biologists to investigate novel proteins based on cluster information with well-curated proteins. Several public protein classification databases are available on the Web, like SCOP, Pfam, etc. Manually constructing classification information for proteins is a tedious and time-consuming task. Therefore, we apply the clustering method to group similar protein sequences into the same cluster. To compare with the clustering method, we also use the classification method to classify proteins based on a set of features or parameters to characterize each protein sequence. BLAST and Smith-Waterman tools are employed to calculate pairwise similarities between two protein sequences so that a set of proteins form a set of graph nodes with similarities as edge weights. The shortest path method of graph theory is then applied to optimize the similarities between sequences. The protein function is also considered to improve the quality of the graph by regarding protein motifs as features and constructing protein similarities based on the vector model of information retrieval. By combining both graphs, the hierarchical agglomerative clustering (HAC) algorithm is employed to cluster protein sequences. We also select three features to classify proteins with SVM. Experiments show that these methods are not good enough. However, in this thesis, we observe some future works and obtain experiences of doing research.

參考文獻


[1] Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L., and Sonnhammer, E. L. L., “The Pfam Protein Families Database,” Nucleic Acids Research, 28(1):263-266, 2000.
[2] Barker, W. C., Garavelli, J. S., Huang, H., McGarvey, P. B., Orcutt, B. C., Srinivasarao, G. Y., Xiao, C., Yeh, L. S. L., Lendley, R. S., Janda, J. F., Pfeiffer, F., Mewes, H. W., Tsugita, A. and Wu, C., “The Protein Information Resource (PIR),” Nucleic Acids Research, 28(1):41-44, 2000.
[3] Vinga, S., Oliveira, R. G. and Almeida, J. S., “Comparative evaluation of word composition distances for the recognition of SCOP relationships,” Bioinformatics, 20(2):206-215, 2004.
[4] Gibas, C. and Jambeck, P., “Bioinformatic,” O’relly, 2002
[5] Ahn, G. T., Kim, J. H., Hwang, E. Y., Lee M. J. and Han, I. S., “SCOPExplorer: A Tool for Browsing and Analyzing Structural Classification of Proteins (SCOP) Data,” Molecules and Cells, 17(2):360-364, 2004.

延伸閱讀