透過您的圖書館登入
IP:18.190.152.38
  • 學位論文

論蛋白質二級結構預測之效能優化

On the Enhancement of the Efficiency of Protein Secondary Structure Prediction

指導教授 : 羅惟正
本文將於2024/10/26開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


蛋白質二級結構預測方法廣泛使用在生物資訊領域,而常用的二級結構預測方法的共通步驟是搜尋一個相當大的胺基酸序列目標資料集以製作位置特異性得分矩陣 (PSSM),作為機器學習方法的特徵,最常用的目標資料集是最新釋出的UniRef資料集。然而,在這個後基因體時代,定序出的蛋白質數量呈指數型增長,但二級結構預測準確度成長有限,我們真的有需要那麼大的資料集? 在這篇文章中我們提出了兩個假設:第一,目標資料集的大小對預測速度的影響遠大於對準確度的影響。第二,目標資料集內之序列間的相似程度會影響預測的準確度。我們將原始目標資料集(UniRef90 2015)抽樣成不同大小以驗證我們的第一個假設,實驗結果支持我們的假設,目標資料集縮小到原本的1/10,預測速度提升了接近10倍,準確度只降低了不到1%,並且得出了一組公式,可以模擬不同目標資料集大小下所需的預測時間和達到的準確度。為了驗證第二個假設,我們降低原始目標資列集中序列的相似程度,並且抽樣到同樣的大小,我們發現了序列間的相似程度確實會影響準確度,而且,相似程度降低反而讓準確度提升。另外,為了探究準確度為何會提升,我們發現了一項指標可以衡量目標資料集製作出的PSSM的品質,PSSM中所含的資訊量,也就是熵(entropy),在單一變因的條件下與二級結構預測準確度呈高度正相關。 最後,我們提出了一個簡單的策略,降低目標資料集胺基酸序列間的相似度到25%,然後,抽樣到382萬條序列的大小,會達到最佳的效率。只要換上新的目標資料集就可以大幅提升目前運行的蛋白質二級結構預測系統的預測速度,並且維持同樣水準的準確度。

並列摘要


The secondary structure prediction (SSP) methods are widely used in bioinformatics fields, and the essential step of those prediction methods are generating the position-specific scoring matrix (PSSM) by searching a huge target dataset. The PSSM scores are taken as features in machine learning methods. Those SSP methods often employ the latest non-redundant protein datasets. However, amino acid sequence numbers grow exponentially in this post-genomics era, but the SSP accuracy increases slightly. Do we really need such a huge target dataset? Since the accuracy of SSP almost reaches the upper limit, we focus on enhancing the prediction speed and maintain accuracy. We hypothesize that 1) the size of target datasets influences SSP running time more than SSP accuracy, and 2) the homology level of target datasets may influence SSP accuracy. To verify our hypotheses, first, we resample the original UniRef90 target dataset of the year 2015 to different sizes and measure the SSP performance. The result shows that the speed is enhanced to 10 times, and the SSP accuracy only decreases by 1% when the target dataset is reduced to 1/10 size. Second, we reduce the homology level of the original target dataset and resample it to a fixed size. We find that the homology level of target datasets certainly affects SSP accuracy. Surprisingly, the lower homology level of target datasets promotes SSP accuracy. Besides, to further explore why the accuracy improves at a lower homology level, we find that the information entropy of PSSM may be a possible indicator to measure PSSM quality. The entropy of PSSM has a high correlation with SSP accuracy. To sum up, we propose a simple strategy that can significantly enhance the speed of secondary structure prediction without sacrificing accuracy. First, reducing the homology level of the original target dataset to 25%. Then, resampling the homology-reduced target dataset to 3.82 million sequence number. This strategy can be easily recruited into current secondary structure prediction systems and therefore enhance overall secondary structure prediction efficiency.

參考文獻


1. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistency-based multiple sequence alignment. Genome Res 2005, 15(2):330-340.
2. Pei JM, Kim BH, Grishin NV: PROMALS3D: a tool for multiple protein sequence and structure alignments. Nucleic Acids Res 2008, 36(7):2295-2300.
3. Soding J, Biegert A, Lupas AN: The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res 2005, 33:W244-W248.
4. Li B, Krishnan VG, Mort ME, Xin F, Kamati KK, Cooper DN, Mooney SD, Radivojac P: Automated inference of molecular mechanisms of disease from amino acid substitutions. Bioinformatics 2009, 25(21):2744-2750.
5. Folkman L, Yang Y, Li Z, Stantic B, Sattar A, Mort M, Cooper DN, Liu Y, Zhou Y: DDIG-in: detecting disease-causing genetic variations due to frameshifting indels and nonsense mutations employing sequence and structural properties at nucleotide and protein levels. Bioinformatics 2015, 31(10):1599-1606.

延伸閱讀