透過您的圖書館登入
IP:3.14.253.221
  • 學位論文

以語言模型學習預測藥物可能關聯

Predicting Plausible Relations of Drugs Using Machine Learning with Language Models

指導教授 : 許永真

摘要


本論文旨在探討能否利用語言模型學習到藥物的隱含知識並且預測藥物間可能的作用。 近期研究顯示詞向量包含的隱含知識能用來發現新的關係,這篇論文以語言模型學習生物醫學知識來預測藥物可能資訊且提出了改良版的預訓練目標來更好的理解藥物知識。藥物不良反應每年花費高達1360億美元,而其主要原因是多重用藥帶來的預期外的藥物間互相作用。因為對於所有上市藥物組合作實驗是不切實際的,預測可能會有反應的藥物組合或是篩選較有可能的組合便非常重要。目前在機器學習領域上的預測主要分為兩類,一類是利用藥物化學結構做預測,另一類根據電子病歷的回報做預測,我們希望能在病患產生藥物不良反應前正確預測藥物間互相作用,我們藉由學習2015年底前發表的醫療文獻,去預測2016到2020年間發現的藥物作用。我們將一句形容藥物間互相作用的句子,藉由遮住其中提到的藥物形成克漏字測驗,在14,184個測試句型、3607個候選藥物中,我們達到的前十預測命中39.8%的比例。我們的結果可能可以幫助改善現在學習目標收集資訊的能力,並且新增預訓練模型的應用。

並列摘要


The main purpose of this thesis is to investigate whether we can use language model to capture latent knowledge of drugs and predict plausible drug-drug interactions. Recent research suggest that word embeddings capture latent knowledge that we can use to induce undiscovered relations. In this thesis, we embed language model with biomedical knowledge to predict plausible relation and propose a modified pre-training schema to better capture latent knowledge of drugs. Adverse Drug Reactions (ADRs) cost over $136 billion a year and the drug-drug interactions (DDIs) due to polypharmacy has been its major cause. Since it is infeasible to run experiments on all pairs of approved drugs, predicting the most plausible interactions and evaluating their probabilities play an important role in DDIs research. This research aims to investigate whether language models may be used to capture latent knowledge of drugs and predict plausible drug-drug interactions. We embed language model with biomedical knowledge to predict plausible relation and propose a modified pre-training schema to better capture latent knowledge of drugs. To validate the utility of the pre-trained language model, we deploy a cloze test on future discovered relation sentences. That is, given a sentence describing a relation between two drugs, we mask out one drug at a time for the model make predictions of the other. The experiments show that we have over 39.8% hits at ten on 14,184 testing sentences over 3607 drugs. Our results may help improve current learning objectives on leverage information for entities and explore the usage of pre-trained language model.

參考文獻


[1] I. Beltagy, K. Lo, and A. Cohan. Scibert: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3606–3611, 2019.
[2] O. Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl 1):D267–D270, 2004.
[3] A. Bravo, J. Pi˜nero, N. Queralt-Rosinach, M. Rautschka, and L. I. Furlong. Extraction of relations between genes and diseases from text and large-scale data analysis: implications for translational research. BMC bioinformatics, 16(1):55, 2015.
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[5] J. Feldman, J. Davison, and A. M. Rush. Commonsense knowledge mining from pretrained models. arXiv preprint arXiv:1909.00505, 2019.

延伸閱讀