透過您的圖書館登入
IP:3.22.71.188
  • 學位論文

辨識多連線物件以改善蛋白質家族之階層式分群

Identify Hub Objects to Improve Hierarchical Clustering of Protein Families

指導教授 : 林宣華
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


由於近年來生物序列定序技術的蓬勃發展,蛋白質序列發表於公開資料庫的數量也急速增加,但相關功能描述資訊卻很少,使得蛋白質序列功能分析必須藉由電腦的處理來加快速度。在同一個蛋白質家族裡的成員,通常都具有相同或相似的生化功能,且在序列間會有相當的相似程度。利用分群演算法將序列相似的蛋白質放在同一群,可能用於簡化人工分析的耗時工作,用於協助生物學家快速推論未知蛋白質序列的功能性和結構性。本論文以階層式聚合分群法 (HAC),將蛋白質序列做分群,結果 (F1 = 53.3%) 並為劣於參考文獻。經由觀察問題所在,我們提出新的HAC合併方法LOM (Loss of Merging),可以獲得相似的分群結果 (F1 = 74.7%)。我們更進一步觀察到,遠端同源性 (remote homology) 序列相似的遞移性 (transitivity),會造成分群效能無法提升。當群集愈來愈大時,大的群集往往會由大部份相似度高的序列所主導,而導致無法再和其它大的群集合併。針對這個問題,基於小小世界理論 (the Small World theory) 的想法,我們從群集中找出hub objects,將群集進一步合併。以SCOP分類資料驗證,F1效能可由74.7%提升到88.7%,優於目前其他方法。

並列摘要


With the advent of new sequencing technologies, public biological databases store large amounts of protein sequences without curated information of functions and structures. The computer-aided functional analysis methods for protein sequences become important. Proteins curated in the same family or super family are usually regarded as similar in functions or structures that are frequently derived from similar sequences. Applying clustering algorithms to group proteins into the same cluster based on similarities among those sequences seems a possible approach to predict functions and structures of unknown proteins. In this thesis, we apply the Hierarchical Agglomerative Clustering (HAC) algorithm to group proteins into clusters based on the sequence similarity. The result (F1 = 53.3%) is worse than several studies. By observing problems of the clustering method, we propose a novel measure, LOM (Loss of Merging), that improves the result of HAC to F1 = 74.7%, which is comparable with the current best result (F1 = 74.9%) of Pro-Clust. Furthermore, we found that the remote homology and the transitivity of similarity result in no room to improve the clustering performance. Large clusters are usually dominated by most strongly similar sequences, the problem get worse while the cluster becomes larger and larger. Therefore, larger clusters have no room to be merged. Based on ideas of the Small World theory, we identify hub objects for large clusters and find joinable clusters to merge together. Using the same SCOP database as the benchmark, the performance of F1 is improved from 74.7% to 88.7%. The result is better than all methods tested on SCOP.

參考文獻


[1] Agrawal, R., Imieliński, T. and Swami, A. “Mining association rules between sets of items in large databases,” ACM SIGMOD Record, Vol. 22, 207-216, 1993.
[2] Ahn, G. T., Kim, J. H., Hwang, E. Y., Lee M. J. and Han, I. S., “SCOPExplorer: A Tool for Browsing and Analyzing Structural Classification of Proteins (SCOP) Data,” Molecules and Cells, 17(2):360-364, 2004.
[3] Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J., “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, 25(17): 3389-3402, 1997.
andsequence databases. Comput.Chem, 17, 149-163, 1993
[4] Antje Krause, Jens Stoye,and Martin Vingron,” Large scale hierarchical clustering of protein sequences,” 2005

延伸閱讀