透過您的圖書館登入
IP:3.144.48.135
  • 學位論文

電腦設計胜肽抑制劑、次世代定序軟體及建構定量構效模型

Computational Design for Peptide Inhibitors, Next-Generation Bioinformatics Tools, and Building Quantitative Structure-Activity Relationship Models

指導教授 : 曾宇鳳
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


在電腦輔助藥物設計的領域中,電腦不僅有協助尋找新的治療化合物的功用。在藥物發展的過程當中,從治療目標的尋找、分子設計、以及化合物特性的預測都可以使用生物資訊或機器學習的技術來輔助。在本篇論文中,將會討論藥物發展過程當中從一開始利用次世代定序技術尋找治療目標、設計胜肽治療藥物、以及建構毒性模型上面可以使用電腦輔助的應用。 本論文中一共描述了三個部分和電腦輔助藥物設計相關的研究。在第一章中,描述了利用電腦輔助的方式設計胜肽來優化由trowaglerix α subunit以增加其對醣蛋白六(GPVI)之抑制活性,以抑制由膠原蛋白引發的血小板凝集現象。而在第二章則是一篇文獻探討,介紹在分析次世代定序資料時,會使用到的全新序列組裝和基因組分析工具。最後在第三章中,我們則以四腺蟲的水中毒性模型資料為例,來解釋如何以適當的方法進行設計一個毒性預測模型。 Trowaglerix是一個由黃德富教授的研究團隊所發現會抑制GPVI所調控的膠原蛋白引發血小板凝集。此外,他們亦從trowaglerix α subunit的序列中發現有同樣抑制功能的十聚體胜肽(decamer peptide)。第一章介紹了如何使用分子嵌合以及分子動態模擬的技術以預測該胜肽和GPVI可能的結合位置。從結果中我們推測該胜肽可能的結合位置位於GPVI D1/D2 domain的下方表面,但仍無法排除在膠原蛋白結合位相同的可能性。而在設計抑制血小板凝集的胜肽部分,我們開發了一個使用貪婪算法為基礎的胜肽設計方法,以優化目前decamer peptide的抑制活性。我們一共設計了具有在膠原蛋白結合位有抑制GPVI潛力的六個十聚體胜肽、十一個六聚體胜肽、以及十個十二聚體胜肽,以及在D1/D2 domain結合位有抑制GPVI潛力的十二個十聚體胜肽。在實驗結果中,其中的一個十聚體胜肽具有和原本trowaglerix decamer peptide相近的抑制血小板凝集能力。 自2005年發表後,次世代定序技術(Next-Generation Sequencing, NGS)已經改變了在基因體學和轉錄體學的研究方式。隨著NGS技術的爆炸性發展,目前在一次NGS的分析中,已經可以產生約1.8兆鹼基對的資訊,亦使定序一個人類基因體的成本降至約1000美元。目前NGS的應用方式主要有兩種,一種是基因體組裝,另一種則是基因表現量的分析。在這裡將會探討應用在全新序列組裝和基因組差異分析的生物資訊工具。在全新序列組裝中,主要將短序列組裝成長序列的演算法有兩種︰overlap-layout-consensus和de Bruijn graph這兩個圖形演算法。而在基因組差異分析中,我們亦可使用之前和微陣列基因組差異分析相同的工具,但NGS的資料可以提供更好的準確程度。 在定量構效分析(QSAR)的領域,隨著包含以及應用愈來愈多不同探索化學資料庫的方法,使這個領域有著長遠的進步。然而,還是有一些分子資料集合使用傳統的二維或三維QSAR方法仍然沒有辦法得到很好的預測模型。如果有一個適合的方法能夠應用在大型且結構繁雜的化學分析資料庫以及分類不平衡(極大部份化合物的實驗結果相同)的情形,將對科學家研究許多生物以及化學的資料上面會是一大突破。在第三章的研究中,我們探索、分析、以及討論如何利用支持向量回歸建構一個針對四腺蟲水中毒性的連續性模型。這個模型使用了三種不同的分子描述式以盡量完整包含分子可能的物理化學性質:(i)二維、二維半、以及三維的物理特性,(ii) VolSurf-like分子交互作用領域,以及(iii)四維分子指紋。在訓練集合中最好的迴歸平方值(R2)可達0.924,而在兩個訓練集合中的迴歸平方值則分別為0.832及0.620。在此研究中,我們呈現了利用訓練資料預先整理、使用和生物特性相關的分子描述式、及支持向量迴歸所建構之QSAR預測模型的預測能力。

並列摘要


In the field of computer-aided drug discovery, finding therapeutic molecules is not the only application that computers can assist. The drug development process from target discovery, molecule design, to compound properties prediction can be assisted by bioinformatics and machine learning techniques. In this dissertation, it was discussed about computer-aided drug discovery techniques from finding therapeutic targets with next-generation sequencing techniques, designing therapeutic peptides, to building toxicology prediction models. In this dissertation three works related computer-aided drug discovery were described. In the first chapter, the work of computational aided designing peptides from α subunit of snake venom trowaglerix, to inhibit collagen-induced platelet aggregation by binding to the glycoprotein VI (GPVI), was described. In the second chapter, there is a review of choosing error correction and de novo assembly tools for single molecule sequencing. In the third chapter, a dataset of aqueous toxicology in Tetrahymena pyriformis was taken as an example, to explain how to design an adequate protocol in toxicology modeling. The snake venom trowaglerix has been discovered by Prof. Ter-Fu Huang’s group that inhibits collagen-induced platelet aggregation induced by GPVI. In addition, they have also found a decamer peptide cleaved from the α subunit of trowaglerix that had also the activity inhibiting the collagen-induced platelet aggregation. In the study described in chapter one, the molecular docking and molecular dynamics simulation were performed for predicting the binding position of the decamer peptide onto GPVI. Two possible binding sites were suggested: one at the inner surface between the D1 and D2 domain (D1/D2 binding site), the other at the collagen binding site. For suggesting new GPVI-inhibiting peptides, a greedy algorithm based peptide design method was developed to optimize the binding potency. 6 decamer peptides, 11 hexamer peptides, and 10 dodecamer peptides were suggested for potentially binding with GPVI on the collagen binding site, and 12 decamer peptides for potentially binding with GPVI on the D1/D2 binding site. Within these peptides, one of the decamer peptides had the nearly activity with the original peptide in the inhibition of collagen-induced platelet aggregation. Since the first announced in 2005, the next-generation sequencing (NGS) techniques has changed the way in the research of genomics and transcriptomic. With the explosive development of NGS techniques, currently it can provide up to 1,800 giga-basepairs per run in NGS platforms, and reduce the cost to $1,000 US dollars per human genome. The main applications of data from NGS were in two ways, genome assembly and gene expression analysis in transcriptome. A review of NGS bioinformatics tool is performed here of bioinformatics tool used in de novo genome/transcriptome assembly and gene set enrichment assembly. In de novo assembly, short reads were assembled into longer contigs with mainly two algorithms: overlap-layout-consensus and de Bruijn graphs. In gene set expression analysis, most microarray tools can still be used but with higher accuracy. The inclusion and accessibility of different methodologies to explore chemical datasets has been beneficial to the field of Quantitative Structure-Activity Relationship (QSAR) modeling. Several molecular systems that historically do not perform well using traditional and three-dimensional QSAR methodologies have been adequately explained using contemporary QSAR modeling methods and protocols. The ability of these methodologies and protocols to accommodate large datasets (several thousand compounds) that are chemically diverse – and in the case of classification modeling unbalanced (one experimental outcome dominates the dataset) – allows scientists to further explore a remarkable amount of biological and chemical information. In the study described in chapter three, the creation of a continuous Tetrahymena pyriformis (T. pyriformis) model was explored, analyzed and discussed using Support Vector Regression (SVR) techniques. The models are constructed with three types of molecular descriptors that capture the gross physicochemical features of the compounds: (i) 2D, 2 1⁄2 D, and 3D physical features, (ii) VolSurf-like molecular interaction fields, and (iii) 4D-Fingerprints. The best model had an R2 value of 0.924 for the training set and was able to predict the continuous endpoints for two test sets with R2 values of 0.832 and 0.620. The studies presented within demonstrate the predictive ability (classification and continuous endpoints) of QSAR models constructed from curated datasets, biologically relevant molecular descriptors, and Support Vector Regression.

參考文獻


1. Lopez, A. D.; Mathers, C. D.; Ezzati, M.; Jamison, D. T.; Murray, C. J., Global and regional burden of disease and risk factors, 2001: systematic analysis of population health data. Lancet 2006, 367, 1747-57.
2. Ruggeri, Z. M., Platelets in atherothrombosis. Nature Medicine 2002, 8, 1227-34.
3. Michelson, A. D., Antiplatelet therapies for the treatment of cardiovascular disease. Nature Reviews. Drug Discovery 2010, 9, 154-69.
4. Yousuf, O.; Bhatt, D. L., The evolution of antiplatelet therapy in cardiovascular disease. Nature Reviews. Cardiology 2011, 8, 547-59.
5. Gachet, C., Antiplatelet drugs: which targets for which treatments? Journal of Thrombosis and Haemostasis 2015, 13, S313-S322.

延伸閱讀