透過您的圖書館登入
IP:18.226.251.22
  • 學位論文

利用機器學習演算法篩選適當模板結構提升預測轉錄因子結合序列特徵之準確度

Selecting appropriate template structures to improve precision in predicting protein-DNA binding profiles

指導教授 : 歐陽彥正
共同指導教授 : 陳倩瑜
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

並列摘要


DNA-binding proteins such as transcription factors use DNA-binding domains (DBDs) to bind to specific sequences in the genome to initiate many important biological functions. Accurate prediction of such target sequences, often represented by position weight matrices (PWMs), is an important step to understand many biological processes. Recent studies have shown that knowledge-based potential functions can be applied on protein-DNA co-crystallized structures to generate PWMs that are considerably consistent with experimental data. However, this success has not been extended to DNA-binding proteins lacking co-crystallized structures. This study aims at investigating the possibility of predicting the DNA sequences bound by DNA-binding proteins from the proteins’ unbound structures (structures of the unbound state). Given an unbound structure of the query protein, the proposed method first aligns this structure to all the template structures to generate synthetic protein-DNA complexes. Then it builds a classifier using support vector machines (SVM) to select the most appropriate complex for PWM prediction. The feature set incorporated in the predicting model includes the similarities between the query and template proteins, structural composition such as percentage of alpha-helix, and the number of residues falling within specific distances between the protein and DNA in the synthetic protein-DNA complex. Once the appropriate complex is available, an atomic-level knowledge-based potential function is employed to predict PWMs characterizing the sequences to which the query protein can bind. The evaluation of the proposed method is based on 19 DNA-binding proteins which have structures of both DNA-bound and unbound forms for prediction as well as annotated PWMs for validation. Based on the analyses conducted in this study, the conformational change of proteins upon binding DNA was shown to be the key factor that influences the prediction accuracy the most. Moreover, to facilitate the procedure of predicting PWMs based on protein-DNA complexes or even structures of the unbound state, the web server, DBD2BS, is presented. The DBD2BS server provides users with an easy-to-use interface for visualizing the PWMs predicted based on different templates and the spatial relationships of the query protein, the DBDs and the DNAs. This study sheds light on the challenge of predicting the target DNA sequences of a protein lacking co-crystallized structures, which encourages more efforts on the structure alignment-based approaches in addition to docking- and homology modeling-based approaches for generating synthetic complexes.

參考文獻


1. Bulyk, M.L., Computational prediction of transcription-factor binding site locations. Genome Biol, 2003. 5(1): p. 201.
2. Stormo, G.D., DNA binding sites: representation and discovery. Bioinformatics, 2000. 16(1): p. 16-23.
3. Siggia, E.D., Computational methods for transcriptional regulation. Curr Opin Genet Dev, 2005. 15(2): p. 214-21.
4. Xing, E.P. and R.M. Karp, MotifPrototyper: a Bayesian profile model for motif families. Proc Natl Acad Sci U S A, 2004. 101(29): p. 10523-8.
5. Mahony, S., et al., Improved detection of DNA motifs using a self-organized clustering of familial binding profiles. Bioinformatics, 2005. 21 Suppl 1: p. i283-91.

延伸閱讀