透過您的圖書館登入
IP:3.14.141.17
  • 學位論文

基於機器學習方法之蛋白質複合體分類研究

Protein Complexes Classification Study Based on Machine Learning Approaches

指導教授 : 黃建宏
共同指導教授 : 吳家樂(ka-lok Ng)
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


雖然蛋白質為細胞作用的最終產物,但單一蛋白質無法自行發揮功能,需透過與其他蛋白質產生交互作用,形成蛋白質複合體才能執行功能,因此發展有效的方法來預測蛋白質複合體的研究有相當顯著的重要性。目前已發展了許多方法來預測蛋白質複合體,例如 (1) 使用圖形理論研究蛋白質交互作用的密集區域; (2) 植基於實驗資料,如質譜儀 (mass spectrometry);(3) 核心分支法 (core attachment) 及 (4) 異質資料整合法 (heterogeneity data integration)。 然而上述的方法均存在一些限制,即它們都只考慮蛋白質複合體的靜態、未包含生化的性質,以致於影響預測的效果,故此仍有改進空間。 本研究中除了考慮複合體子單元交互作用的拓撲性質外,我們也考慮各子單元物理化學的性質以描述整個複合體。方法主要分為三大步驟,分別為一:參數計算、二:選擇重要參數、三:分類精確度的驗證。在參數計算步驟中,我們考慮27種參數,並使用兩種不同的統計分析方法,主成分分析(Principle Component Analysis ; PCA)以及邏輯回歸(Logistic Regression; LR),以判斷參數的重要性。前述兩步驟所萃取出的重要參數,將會在分類精確度驗證的步驟中用於建構特徵向量,之後運用兩種機器學習方法,即支撐向量機(Support Vector Machines; SVM)和類神經網路(Neural Network; NN)進行訓練,並執行6-fold交叉驗證以檢測各種特徵組合的分類精確度。 實驗結果顯示,在兩種參數的預測上以GO註解與序列相似度準確率最佳,而在加入等電點 pI性質一同分類後,可略為提升預測效果。這也驗證了在蛋白質複合體預測中,理化性質有其參考的價值。

並列摘要


Abstract Protein complexes play important roles in many cellular processes. There are several approaches have been developed for protein complexes prediction; such as (1) using graph theory to study dense protein-protein interaction regions, (2) based on experimental data, such as tandem mass spectrometry, (3) the core attachment approach, and (4) heterogeneity data integration. All of these approaches have certain limitations for these approaches considering only the static, non-biochemical properties of a protein complex. In this thesis, we suggest to integrate various aspects of protein complexes property, i.e. staticas well as the physiochemical properties, and to describe protein complexes.Our method consists of three mainsteps; (i) estimation of parameter values and(ii) major parameters selection, and (iii) validationof classification accuracy.In the parameter estimation step, 27 parameters are considered. Principle component analysis (PCA) and logistic regression (LR) methods are used to determine the major features. In the validation step,major features are extracted from the previous step and are used to construct the feature vectors. After that, they are trained by two machine learning methods, i.e. support vector machines (SVM) and neural network (NN). The 6-fold cross-validation test is performed to investigate the classification accuracy of all major feature subsets. In case of combining Isoelectric point (pI) with GO annotation and sequence similarity features, the result indicates that it can achieve a slightly better classification accuracy. Taking the physiochemical properties for consideration, the present study could possibly improve the accuracy for protein complex prediction tools .

並列關鍵字

Protein Complexes Machine Learning

參考文獻


[1] 劉彥岐 (2006).基於蛋白質交互作用網路與生物註解分析之系統化萃取功能模組架構.國立成功大學碩士論文
[2] I.T. Jolliffe (2002). Principal Component Analysis. Springer.
[5] Gary D. Bader and Christopher W.V. Hogue (2003). An automated methodfor finding molecular complexes in largeprotein interaction networks. BMC Bioinformatics. 4:2. http://www.biomedcentral.com/1471-2105/4/2
[6] H Yu. et al. (2006).Predicting Interactions in Protein Networks by Completing Defective Cliques. Bioinformatics. 22(7).823-829.
[7] Victor Spirin. and Leonid A. Mirny (2003). Protein Complexes and Functional Modules in Molecular Networks. PNAS 100(21). 12123-12128.

延伸閱讀