人類基因與疾病關係之規則擷取

在諸多記載著有關人類遺傳疾病的生物資訊文獻中，研究人員想嘗試著利用各種方法計算出人類遺傳疾病與基因的關聯度，並從中找尋出一些規則或相關性，進而了解兩者之間的關係。若方法適用的話，就可以運用在往後的文獻資料上，在大量產出的文獻上利用這一些規則(rules)及運算方法，如此即可找出疾病與基因兩者之間的關係，這樣既可以幫助閱讀的人，又能省下時間，研究人員們期望藉此方法可以增進生物醫學的發展速度，早日找出解決這些疾病的辦法。本論文中所使用方法簡述如下：我們使用的資料包含醫學文獻資料庫(Medical Literature Analysis and Retrieval System Online, MEDLINE)，首先從MEDLINE擷取需要使用的資訊：包含TI以及AB，TI為標題，而AB為內文。其次利用線上孟德爾遺傳學(Online Mendelian Inheritance in Man, OMIM)提供的morbid標準答案來找出遺傳疾病與基因有關係的正確句子出來。然後用Memory-Based Shallow Parser (MBSP)來剖析這些正確句子以及隨機挑選出的不正確句子以得到詞性(part of speech)的資訊，接著使用ILP framework的ALEPH系統來學習規則。在ILP framework中包含了三個元素，分別是hypothesis H、background knowledge B以及examples E，如果知道了B和E就可以得出H。而在找出來的這些規則中，我們提出一些計算方式實驗取得較好的規則出來，最後評量時就是利用這些規則找出相關聯的疾病與基因，最後再以準確度及回收率做為評估的準則。實驗結果顯示最好的F-score為66.9%，此時的準確度為70.6%，此時回收率為63.5%。

關鍵字

規則擷取；規則學習；疾病與基因關係；生物醫學文獻探勘

並列摘要

In many biomedical literatures about human genetic diseases, researchers try to use different methods to find some rules or relations between human genetic diseases and genes. If the methods are good to use, then people can use these rules to find relations in more biomedical literatures faster and easier. The researchers expect these methods can improve the speed of development of the biomedical domain and then it is possible to find out a way to cure these diseases. We used the data provided by Medical Literature Analysis and Retrieval System Online (MEDLINE). First we retrieved the required information from MDELINE, including TI and AB, where TI means title and AB means abstracts. Second, we used the morbid data which was provided from Online Mendelian Inheritance in Man (OMIM) to find the correct sentences about human genetic diseases and genes, and also picked the wrong sentences randomly. Third, we used Memory-Based Shallow Parser (MBSP) to parse these sentences to get the part-of-speech and other information. At last, we used the ALEPH system by utilizing the above information to learn rules. ALEPH is an ILP framework. An ILP framework contains three elements, hypothesis H, background knowledge B and examples E. If we have B and E, then we can inference H which corresponds to rules in our experiment. We proposed some methods of calculation to get better rules, and then we used these rules to find the sentences which are related to human genetic diseases and genes. We used precision, recall and F-score to be our experiment’s measure metrics. The experiment’s results showed that the best F-score is 66.9% where the precision is 70.6% and the recall is 63.5%.

並列關鍵字

rule extraction ； rule learning ； gene-disease relationship ； biomedical text mining

參考文獻

J. Y. Chen, C. Shen, and A. Y. Sivachenko, “Mining Alzheimer disease relevant proteins from integrated protein interactome data,” Pacific Symposium on Biocomputing, vol. 11, 2006, pp. 367-378.

EMBASE database. Available from http://www.embase.com/.

Katrin Fundel, Robert Kuffner and Ralf Zimmer, “RelEx─Relation extraction using dependency parse trees”, Bioinformatics, Vol. 23, no. 3, 2007, pp. 365-371.

fnTBL. Available from http://nlp.cs.jhu.edu/~rflorian/fntbl/.

Y. Hu, L. M. Hines, H. Weng, D.Zuo, M. Rivera, A. Richardson, and J. Labaser, “Analysis of genomic and proteomic data using advanced literature,” Journal of Proteome Research, vol. 2, 2003, pp. 405-412.

國際替代計量

人類基因與疾病關係之規則擷取

主題瀏覽