生醫文獻中疾病與藥物關係之樣式自動化擷取

本研究嘗試從生醫文獻中找出人類疾病與藥物的關聯度，並在人類疾病與藥物之間得到一些規則或是關聯性。若能自動從文獻中預測疾病與藥物之間的相關性，對於未來生醫研究人員探討疾病與藥物的文獻資料時，就可以利用此關聯性，快速了解疾病與藥物之間的關係，達到快速獲取資訊的目的，既可以節省人力與時間成本，也能加速生物醫學的發展速度。本研究所使用的資料為Clinical trials (https://clinicaltrials.gov/)網站中提供的一些美國官方已完成的疾病研究和藥物的配對，以及PubMed資料庫(https://www.ncbi.nlm.nih.gov/pubmed/)的生醫文獻摘要。在本論文中，首先從PubMed文章摘要找出含有Clinical trials所提及到的疾病與藥物之句子，視為正向的句子；以及相同疾病卻不同的藥物之句子，視為負向的句子。透過兩種模型，第一種是句子中疾病位置在前、藥物位置在後；第二種則是句子中藥物位置在前、疾病位置在後，以便分析在疾病與藥物之間的動詞、名詞等相關資訊。本研究將這些單字分為純關聯、純無關聯性、混合字，再使用卡方檢定(chi-square test)把符合門檻的中性字再做一次的分類，得到疾病與藥物關係之樣式規則，最後利用這些樣式規則與測試資料做比對與評估，本研究實驗最佳結果Precision為100%、Recall為89%以及F-score為94%。

關鍵字

疾病-藥物關聯度；樣式擷取；生醫文獻；卡方檢定

並列摘要

The objectives of this study are to identify the association between human diseases and medications from the biomedical literatures, and to find the rules or relationships between human diseases and drugs. If the association can be identified automatically from literatures, it will help biomedical researchers who is studying the literatures of diseases and medications use the information understand the relationships between diseases and drugs, and have the benefit of collecting the information more efficiently. It would either save the human resource cost and time cost or accelerate the pace of development of biomedical science. The data in this study is from the existing studies of diseases and drugs pairs accomplished by the American authorities in the website of Clinical Trial (https://clinicaltrials.gov/) and biomedical literatures in the website of PubMed (https://www.ncbi.nlm.nih.gov/pubmed/). In this thesis, initially we search for the sentences with the terms of diseases and drugs mentioned in the Clinical trials website and identify these sentences as positive sentences. Then find the sentences with relevant diseases but with different medications and identify these sentences as negative sentences. As to analyze the number of verbs and nouns pertinent to diseases and medications, two models with different sentence structures are established. The first model is for the sentences with the order that word “diseases” precedes the word “medications”. The second model is for the sentences in a reverse order of the first model. Then classify these words into categories of pure association, pure no association and neutrals. Among them, the qualified neutrals are further classified by the method of the chi-square test. The associations between diseases and medications are, as a result, identified which are called patterns later. Finally, use the patterns to test data to extract the disease and drug pairs. The best experimental results show precision value of 100%, Recall value of 89%,and F-score value of 94%.

並列關鍵字

disease-drug association ； pattern extraction ； biomedical literature ； chi-square test

參考文獻

COPD介紹：http://epaper.ntuh.gov.tw/health/201509/health_2.html

Google Scholar

Drug Bank：https://www.drugbank.ca/

Google Scholar

Jang, D., Lee, S., Lee, J., Kim, K., & Lee, D. (2016). Inferring new drug indications using the complementarity between clinical disease signatures and drug effects. Journal of biomedical informatics, 59, 248-257.

Google Scholar

MeSH terms：https://www.ncbi.nlm.nih.gov/mesh/

Google Scholar

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130-137.

Google Scholar

主題瀏覽