建立以機器學習演算法為基礎之評分函數預測蛋白質與DNA結合之親和力

蛋白質是維持生命的重要物質，在生物體內，蛋白質與DNA之結合牽引著許多生化反應與活動，如轉錄因子與特定DNA之結合，可開啟特定基因之轉錄活動。因此長久以來，蛋白質與DNA之間的互動一直是生物學家們所爭相研究的對象，近年來，由於電腦科技與計算能力之發展與進步，生物學家與統計學家們利用電腦程式之計算與彙整能力，逐步輔助傳統生物實驗之研究，而其中，預測蛋白質與其他生物單元如蛋白質、小分子、甚至DNA之互動之親和力一直是備受關注的主題，近年來也有許多針對此議題之研究，開發許多不同種類的親和力預測之評分函數，其中以機器學習演算法為基礎之評分函式，近幾年在預測蛋白質與小分子結合之親和力這個問題上，皆得到不錯的成效。本篇論文嘗試以機器學習演算法為基礎，設計能預測蛋白質與DNA結合親和力之評分函數，此研究篩選高品質的蛋白質與DNA複合物結構與實驗所得之親和力資訊作為本篇論文之材料來源，建構以知識庫搭配機器學習演算法為基礎之評分函數。實驗結果顯示，使用隨機森林為基礎之分類方法，在預測蛋白質與DNA結合親和力之問題上，亦可得到良好的預測結果；本論文同時也引入不同種類的特徵擷取方式，並討論其對預測結果之影響，期待能對生物巨分子之間結合親和力之評分函數開發等研究議題有所貢獻。

關鍵字

蛋白質與DNA交互作用；評分函數；隨機森林；親和力預測

並列摘要

Proteins and DNA play important roles to maintaining life in living cells. The binding of protein to specific DNA sequences is the beginning of lots of bio-activities. For instance, the binding of regulatory sites of DNA by transcription factors, which are a kind of proteins that trigger transcription of a particular gene, initiates the transcription process. Research on this issue could facilitate the studies of gene regulation and regulatory networks. For these reasons, the study of interactions between protein and DNA has attracted much attention for a long time. Recently, with the advances of computer technology and algorithm development, developing computational methods to predict binding affinity of protein-protein, protein-ligand and even protein-DNA interactions has been largely considered recently. Some of the scoring functions for predicting protein-ligand are shown to perform well on this challenge. In this thesis, a machine learning-based scoring function was developed to predict the binding affinity of protein-DNA interactions. For this purpose, a high-quality dataset containing the information of binding affinity associated with a protein-DNA complex was collected from PDBbind. The performance of the proposed method was compared with existing scoring functions, and it is concluded that the proposed machine learning-based scoring function perfrom well in predicting the binding affinities of protein-DNA complexes and can benefit future studies on this problem.

並列關鍵字

protein-DNA interaction ； scoring function ； random forest ； binding affinity prediction

參考文獻

Arnold, K., L. Bordoli, J. Kopp and T. Schwede 2006. The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics 22: 195-201. doi: bti770 [pii]

Ballester, P. J. and J. B. Mitchell 2010. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking. Bioinformatics 26: 1169-1175. doi: btq112 [pii]

Ben-Gal, I., A. Shani, A. Gohr, J. Grau, S. Arviv, A. Shmilovici, S. Posch and I. Grosse 2005. Identification of transcription factor binding sites with variable-order Bayesian networks. Bioinformatics 21: 2657-2666. doi: bti410 [pii]

Bohm, H. J. 1994. The Development of a Simple Empirical Scoring Function to Estimate the Binding Constant for a Protein Ligand Complex of Known 3-Dimensional Structure. Journal of Computer-Aided Molecular Design 8: 243-256.

Breiman, L. 2001. Random forests. Machine Learning 45: 5-32.

國際替代計量

建立以機器學習演算法為基礎之評分函數預測蛋白質與DNA結合之親和力

全文下載

主題瀏覽