使用混合特徵選取於蛋白質結晶預測

蛋白質是生命構成的主要物質。蛋白質的功能會隨著結構不同而不同，因此，研究蛋白質分子的三維結構是科學家們努力的目標。而目前解析蛋白質三維結構的方法，除了利用統計學習理論去預測其結構外，在實作上通常是用X光繞射(X-ray diffraction)或是核磁共振(Nuclear Magnetic Resonance, NMR)實驗的結果來定義。其中，核磁共振耗時且花費成本，還不一定能解析出蛋白質結構。但如果蛋白質的溶液可以析出結晶，便可以用X光繞射來對結晶作分析。不過，不是所有蛋白質都可以產生結晶，故預測蛋白質是否能結晶就成為一個重要的問題。　　我們希望藉由從TargetDB這個蛋白質資料庫所取得的蛋白質的氨基酸序列－即蛋白質的一級結構所提供的各種資訊來進行編碼，並使用F-score和Information Gain兩種特徵選取方法挑出對預測蛋白質結晶幫助較大的特徵。接著，我們將挑選出來的資料分別使用支持向量機和Adaboost演算法來進行學習的工作。支持向量機使用一個超平面(Hyperplane)將空間中不同類別的資料切開，以達到分類的效果；而Adaboost藉由Weak Learner在若干次的學習過程中，不斷的調整每筆訓練用資料的權重值，來降低Weak Learner的錯分率 (error rate)，最後將這些學習的成果結合成為一個Strong Learner來達到分類的效果。　　我們的實驗結果，對targetDB資料的預測正確率可達到93.02% ，而sensitivity (可結晶資料被正確分類為可結晶)為95.49%，specificity (不可結晶資料被正確分類為不可結晶)則是86.08% ，這些實驗的目的，無非是為了找出影響蛋白質不能結晶的要素，並更進一步的去改善這些造成蛋白質無法結晶的因素，以析出這些蛋白質的結晶，便可以利用X光繞射方法取得蛋白質結構的資訊。

關鍵字

支持向量機；適應推進演算法；特徵選取；機器學習；不平衡資料集；蛋白質結晶

並列摘要

Proteins are the major components of organisms. The structure of a protein gives information about its functions. Therefore, it is important to find out the structures of proteins. Nowadays, scientists usually use X-ray diffraction or Nuclear Magnetic Resonance (NMR) to discover the structures of proteins. However, the process of NMR is time-consuming and expensive. Therefore, X-ray diffraction is usually used to determine the structures of proteins. In order to use X-ray diffraction, we have to make sure the target protein can be crystallized. If a target protein can be crystallized, we can use X-ray diffraction to discover the target protein’s structure. Thus, the discovery of crystallization states of the target protein is very important. In this thesis, we use the data in TargetDB to generate a data set that have significant relationships with protein crystallization. We then apply two feature selection methods on the data set to remove the irrelevant or redundant features. After feature selection process, we use the support vector machine (SVM) and Adaboost respectively to predict whether the proteins can be crystallized or not. Furthermore, we compare and discuss the results generated by these two methods. According to our experimental results, applying Adaboost generates higher accuracy than applying SVM on the same data set. The prediction accuracy for Adaboost is 93.02%. Moreover, sensitivity (crystallized data) and specificity (non-crystallized data) by Adaboost is 95.49% and 86.08% respectively. The purpose of our experiments is to find out the factors that may cause proteins to be non-crystallized for Scientists to improve protein crystallization. As a result, X-ray diffraction can be applied to discover the structures of proteins.

並列關鍵字

Support Vector Machine ； Adaboost ； Feature Selection ； Machine Learning ； Imbalance data ； Protein Crystallization

參考文獻

[1] J.M. Tyszka, S.E. Fraser and R.E. Jacobs, “Magnetic Resonance Microscopy: recent advances and applications,” Current Opinion in Biotechnology, Vol.16, Issue 1, 2005, pp.93-99.

[2] A. McPherson, “Introduction to protein crystallization,”Methods, Elsevier, Vol. 34, Issue 3, Nov., 2004,pp. 254-265.

[3] H.Li and M.Niranjan, “Discriminant Subspaces of Some High Dimensional Pattern Classification Problems,” Machine Learning for Signal Processing, IEEE, Aug., 2007, pp. 27-32.

[4] M. Dash, K. Choi, P. Scheuermann and H. Liu,“Feature selection for clustering – A filter solution,” ICDM, IEEE International Conference, Dec., 2002, pp.115-122.

[7] J.S. Taylor and N. Cristianini, “Kernel Methods for Pattern Analysis,” Cambridge University Press, 2004.

國際替代計量

使用混合特徵選取於蛋白質結晶預測

全文下載

主題瀏覽