以統計模型為基礎之複合式蛋白質序列分群演算法

由於蛋白質序列資料庫的大量成長，我們需要有效率的蛋白質序列分析工具。而蛋白質序列分析最常使用的序列比對無法有效地偵測出蛋白質間的疏遠同源性，序列相似度與蛋白質同源性間具有一個不容易判斷的模糊地帶。蛋白質序列分群可以利用蛋白質序列相似度以及蛋白質家族的特性，找出具有同源性的蛋白質集合。我們提出一個以統計模型為基礎的階層式蛋白質序列分群演算法，由單一連結分群演算法加以改良，保留其階層式分群特性以符合蛋白質家族特性。首先利用建立配對群使單一蛋白質可以存在多個分群階層路徑中，再利用統計上常用的對稱度及曲率度找出具有高度同質性的蛋白質分群，最後以代表點建立後半段的分群階層以避免鏈結效應，並且找出具有疏遠同質性的蛋白質分群。本演算法經由SwissProt以及InterPro資料庫驗證，以人類蛋白質作為實驗集合，可以有效地建立出具有高度同質性的蛋白質分群，最後的分群結果也符合InterPro資料庫中蛋白質家族的階層特性，也避免了單一連結分群演算法的鏈結效應。在分群結果中，可以觀察到未知資訊的蛋白質與已知資訊蛋白質間的關聯性。配合我們所開發的分群檢視工具，可以由蛋白質、分群以及家族三個不同方向來觀察分群結果。

關鍵字

統計模型；階層式；分群；蛋白質序列

並列摘要

Protein sequence clustering can group the homologous proteins together based on pair-wise sequence similarities. The conventional single-linkage clustering algorithm has been widely used on this problem because it successfully utilizes the transitivity property to identify remote homologues and provides a dendrogram as clustering result that is useful for protein family analysis. However, due to the twilight zone embedded in the distribution of pair-wise similarities, sometimes the single-linkage algorithm generates clusters with low sensitivity for large families or families with noisy relationships to the members of other protein families. In this thesis, a hybrid hierarchical clustering algorithm is proposed to improve the quality of a dendrogram generated by the single-linkage clustering algorithm. By creating pair clusters, a single protein can exist in distinct hierarchical paths of a dendrogram. Next, the proposed algorithm employs the skewness and kurtosis indices to control the formation of subclusters, in order to generate highly homologous clusters at the bottom level of a dendrogram. Finally, selecting pivots of a subcluster in the following clustering process avoids the chaining effect it might be caused by the single-linkage algorithm. Thus the proposed algorithm can produce clusters with both high sensitivity and specificity at the higher level of a dendrogram. The experimental results in this thesis showed that the hierarchy outputted by the proposed algorithm matches the hierarchy of protein families better than the hierarchy generated by the single-linkage algorithm. In this regard, the generated hierarchy can provide automatic annotations for new protein with higher accuracy than the previous approaches.

並列關鍵字

statistical models ； hierarchy ； clustering ； protein sequence

參考文獻

[1] Enright,A.J. and Ouzounis,C.A. (2000) GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics, 16, 451-457.

[2] Apic,G., Gough,J. and Teichmann,S.A. (2001) An insight into domain combinations. Bioinformatics, 17, S83-S89.

[3] Promponas,V.J., Enright,A.J., Tsoka,S., Keril,D.P., Leroy,C., Hamodrakas,S., Sander,C. and Ouzounis,C.A. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics, 16, 915-922.

[4] Pipenbacher,P., Schliep,A., Schneckener,S., Schonhuth,A., Schomburg,D. and Schrader,R. (2002) ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics, 18, S182-S191.

[5] Sasson,O., Linial,N. and Linial,M. (2002) The metric space of proteins - comparative study of clustering algorithms. Bioinformatics, 18, S14-S21.

國際替代計量

以統計模型為基礎之複合式蛋白質序列分群演算法

全文下載

主題瀏覽