透過您的圖書館登入
IP:3.141.200.180
  • 學位論文

生物醫學名詞關聯擷取及資料庫系統之應用

Biomedical Named Entity Relation Extraction and Its Application to Database System

指導教授 : 黃火鍊
共同指導教授 : 許聞廉

摘要


隨著科學文獻近二十年大量發表,如何系統化地處理與分析其中的生物醫學名詞成為一門重要的領域。本論文將專注於基因與特定疾病的關聯性擷取以及在資料庫系統上的應用。 缺血性心臟病 (Ischaemic Heart Disease, IHD) 已被世界衛生組織列為全球十大死因之首,尤其好發在中、高收入的國家。最常造成IHD發生的原因是冠狀動脈硬化,其它風險因子則有高血脂及基因多樣化等等。傳統的療法已證實有諸多問題,特別是針對有代謝疾病的人效果有限,因此許多研究已開始尋找其他的治療方式,目前已有多篇文獻指出基因療法可能是一個新的方向,然而透過文件探勘來廣泛地找出IHD致病基因及開發相關資料庫的研究仍是相當缺乏。 因此我們將IHD選為研究對象,透過自動化文件探勘技術來分析上萬篇的生物醫學文獻,進而擷取基因與IHD的關聯,主要的探勘過程包括基因名詞辨識、基因名詞正規化、疾病名詞辨識、語料庫標註與模型訓練,並將分析結果建構成一個資料庫系統,同時也把其他相關的資訊,例如蛋白質交互作用、單一核苷酸變異 (SNP) 等與本系統整合起來。我們的關聯性擷取模型經過評量後顯示另人滿意的表現,其精準率 (F-score) 達到80.81%,此外,與其他以人工方式所建立之資料庫比較後也指出,以自動化機器學習為基礎的文件探勘不但在準確度上有一定水準,且所找出的可能致病基因數量也遠大於前者,目前約有700筆致病基因記錄於資料庫中。再者,文件探勘的結果與各種整合資訊,可協助研究人員進一步了解及推論其他可能的致病基因及相關機制。 這是第一個以文件探勘的方式來分析基因與IHD的研究,並透過資料庫系統來呈現其關聯性,未來希望能以此方法為基礎加上專門領域知識來分析更多的特定疾病,且隨著相關應用工具的改良也能更致力改善本系統在各方面的表現。

並列摘要


As the scientific literatures grow exponentially over the past 20 years, how to systematically process and analyze the biomedical named entities becomes an important field. In this thesis, we focus on relation extraction for genes and specific diseases and its application to database system. Ischaemic heart disease (IHD), the most common type of heart disease, has been indicated the top one of the world 10 leading causes of death by World Health Organization, especially in middle and high-income countries. Coronary arteriosclerosis is the most frequent cause of IHD and other risk factors include hyperlipidemia, gene polymorphisms etc. Classic therapeutics for IHD is less effective in individuals with the metabolic syndrome. Several studies have shown that gene therapy may provide a novel means for IHD treatment. However, there is little research about discovering comprehensive gene and IHD relations from text mining and developing related databases. Therefore, we exploit the contemporary text mining technologies to find genes related to IHD from the literature and develop a database system that provides convenient and accurate access of the extracted information and integrated data such as protein-protein interactions and single-nucleotide polymorphisms. The testing results indicate the relation extraction model achieves a satisfactory F-score of 80.81% for IHD. In addition, comparisons with other related databases manually constructed show our text mining approach based on automatic machine learning could not only perform precisely but also dig out much more disease candidate genes. At present, there are about 700 candidate genes available in our database for IHD. Furthermore, the text mining results and the website integrated information may help researchers to understand and infer putative disease genes and mechanisms contributing to the disease. This is the first study focusing on gene and IHD relation extraction by automatic literature text mining and developing a database system for result presentation. In the future, we would like to analyze other specific diseases using corresponding domain knowledge and the similar approach. We also want to adopt better tools to enhance the system performance.

參考文獻


Agirbasli, M., Sumerkan, M. C., Eren, F., & Agirbasli, D. (2011). The s447x variant of lipoprotein lipase gene is inversely associated with severity of coronary artery disease. Heart Vessels, 26(4), 457-463.
Agrawal, S., Dimitrova, N., Nathan, P., Udayakumar, K., Lakshmi, S. S., Sriram, S., et al. (2008). T2d-db: An integrated platform to study the molecular basis of type 2 diabetes. BMC Genomics, 9, 320.
Baumgartner, W. A., Jr., Cohen, K. B., Fox, L. M., Acquaah-Mensah, G., & Hunter, L. (2007). Manual curation is not sufficient for annotation of genomic databases. Bioinformatics, 23(13), i41-48.
Baumgartner, W. A., Jr., Lu, Z., Johnson, H. L., Caporaso, J. G., Paquette, J., Lindemann, A., et al. (2008). Concept recognition for extracting protein interaction relations from biomedical text. Genome Biol, 9 Suppl 2, S9.
Becker, K. G., Barnes, K. C., Bright, T. J., & Wang, S. A. (2004). The genetic association database. Nat Genet, 36(5), 431-432.

延伸閱讀