透過您的圖書館登入
IP:3.147.45.212
  • 學位論文

用於分子特性預測的電腦輔助藥物設計

Computer-Aided Drug Design for molecular property prediction

指導教授 : 曾宇鳳

摘要


基於指紋、基於特徵和基於分子圖的表示都已在其他的研究中與不同的深度學習方法一起用於預測分子特性。不同的分子表示已經被清楚地證明會影響模型預測和可解釋性。我們回顧了不同的分子表示方法,並專注於使用圖形和線性表示方式進行深度學習模型的建立。通常,在計算其特性時,人們會使用一種固定的規範化學結構流程去表示一種分子。我們仔細檢查了表示單個分子的簡化分子線性輸入規範 (SMILES) 符號,並建議使用 SMILES 中的完整列舉以達到更高的模型預測準確性。我們使用了卷積神經網絡 (CNN)的技術來建立模型。SMILES 的完整列舉可以改進分子在模型上的呈現並以所有可能的角度描述分子。用這種方法訓練出的 CNN 模型在處理大型數據集時非常穩健,因為無需加入額外的化學知識來預測溶解度。此外,傳統上很難使用神經網絡來解釋化學結構對單個屬性的貢獻。我們展示了在解碼網絡中使用注意力機制來檢測與溶解度相關的分子部分,從CNN模型中解釋了化學結構對於預測屬性的影響。 生成用於預測分子性質的最佳深度學習模型的關鍵是測試和應用各種優化方法。雖然過去在製藥領域之外的不同研究中的各個優化方法都成功地提高了模型性能,但當小心地應用這些方法和實踐特定的優化方法組合時,模型效能可能可以得到更好的提升。我們使用和討論了文獻中出現的三種高性能優化方法。這些方法已被證明可以顯著提高其他領域的模型性能。我們最終找到一種通用程序,能夠針對不同分子特性去訓練出效果更優化的 CNN 模型。這三種技術分別是針對化合物 SMILES 表示的不同列舉比率去動態調整批量大小策略、用於選擇模型超參數的貝葉斯優化方法以及整合以化學特徵作為輸入資料的前饋神經網絡獲得的特徵與CNN網路學習的分子特徵向量進行結果預測。我們總共使用了七種不同的分子特性(水溶性、親脂性、水合能、電子特性、血腦屏障通透性和抑制)。我們演示了這三種模型優化技術中的每一種如何影響模型,以及最佳模型結合使用貝葉斯優化和動態批量大小調整中受益。

並列摘要


Fingerprint based, feature based, and molecular graph-based representations have all been used with different deep learning methods for prediction of the molecular properties. It has been clearly demonstrated that different molecular representations impact the model prediction and explainability. We reviewed different representations and also focused on using graph and line notations for modelling. In general, one canonical chemical structure is used to represent one molecule when computing its properties. We carefully examined the commonly used simplified molecular input line entry specification (SMILES) notation representing a single molecule and proposed to use the full enumerations in SMILES to achieve better accuracy. A convolutional neural network (CNN) was used. The full enumeration of SMILES can improve the presentation of a molecule and describe the molecule with all possible angles. This CNN model can be very robust when dealing with large datasets since no additional explicit chemistry knowledge is necessary to predict the solubility. Also, traditionally it is hard to use a neural network to explain the contribution of chemical substructures to a single property. We demonstrated the use of attention in the decoding network to detect the part of a molecule that is relevant to solubility, which can be used to explain the contribution from the CNN. The key to generating the best deep learning model for predicting molecular property is to test and apply various optimization methods. While individual optimization methods from different past works outside the pharmaceutical domain each succeeded in improving the model performance, better improvement may be achieved when specific combinations of these methods and practices are applied. Three high-performance optimization methods in the literature that have been shown to dramatically improve model performance from other fields are used and discussed, eventually resulting in a general procedure for generating optimized CNN models on different properties of molecules. The three techniques are the dynamic batch size strategy for different enumeration ratios of the SMILES representation of compounds, Bayesian optimization for selecting the hyperparameters of a model, and feature learning using chemical features obtained by a feedforward neural network, which are concatenated with the learned molecular feature vector. A total of seven different molecular properties (water solubility, lipophilicity, hydration energy, electronic properties, blood–brain barrier permeability and inhibition) are used. We demonstrate how each of the three techniques can affect the model and how the best model can generally benefit from using Bayesian optimization combined with dynamic batch size tuning.

參考文獻


1. Hewitt M, Cronin MT, Enoch SJ et al. In silico prediction of aqueous solubility: the solubility challenge, Journal of Chemical Information and Modeling 2009;49:2572-2587.
2. Llinàs A, Glen RC, Goodman JM. Solubility challenge: can you predict solubilities of 32 molecules using a database of 100 reliable measurements?, Journal of Chemical Information and Modeling 2008;48:1289-1303.
3. Llinas A, Avdeef A. Solubility Challenge Revisited after Ten Years, with Multilab Shake-Flask Data, Using Tight (SD∼ 0.17 log) and Loose (SD∼ 0.62 log) Test Sets, Journal of Chemical Information and Modeling 2019;59:3036-3040.
4. Butina D, Gola JM. Modeling aqueous solubility, Journal of chemical information and computer sciences 2003;43:837-841.
5. Lind P, Maltseva T. Support vector machines for the estimation of aqueous solubility, Journal of chemical information and computer sciences 2003;43:1855-1859.

延伸閱讀