此研究主要是探討如何從蛋白質殘基之物理與化學環境參數,如殘基之二維結構及水溶液表面面積等資訊,預測蛋白質三維結構之摺疊類型(class)。 本研究中將計算SCOP資料庫中蛋白質分子之環境得分矩陣(score matrix)及建立其對應之結構分析表(structural profile)。蛋白質二維結構資料是從DSSP資料庫中飾選出來,再依據SCOP的類型作為我們資料輸入的分類。並藉由開發之程式作蛋白質序列與蛋白質結構(依殘基環境參數)比對,取樣係依殘基水溶液表面面積(埋沒(B),部份埋沒(P) 和暴露(E)),和二維結構資料(α-helix、β-sheet和coil),研究中使用九種環境類型(B、P、E)c。在計算每種SCOP種類的結構分析表的過程中,我們考慮下列簡化:(1)以單體蛋白質,(2)沒有雙硫鍵連結和(3)採取小於比率25%相似性的序列。採用25%的標準過濾相似性過高的序列,原因是避免重複計算他們對得分矩陣的重覆累計。以作為預測蛋白質序列之摺疊類型,經由此方法預測準確性將可達95%以上(平均得分 <0.5時)。最後將檢視此方法使用在相似性低於25%的蛋白質序列之三維結構預測之可能性。
In this thesis, I investigated how the amino acids physicochemical environment information, such as the protein secondary structures and residues solvent accessibility, could possibly enhance one’s capability for protein classes classification prediction. The score matrices for several classes (all-, all-, and according to the SCOP classification) of known protein sequences were computed. Sequences are taken from a protein secondary structure database, for example, the DSSP secondary structure protein databases. Thus, one can construct the 3D structure profiles for each entry in the PDB database. These profiles are used to score the query protein sequence to be modeled for compatibility with the known classes classification. To demonstrate the 3D structure profile method is able to detect sequences compatible with a known class, one aligns the query sequences with the environment of a known protein structure using a simple sequence alignment algorithm. My study indicated that the method has larger than 95% accuracy in protein classes assignment(average score <0.5). Furthermore, I had also established the fact that the structure profile approach is able to detect distant sequences well below the twilight zone (less than 25% sequence similarity).