透過您的圖書館登入
IP:3.138.105.124
  • 學位論文

生物文件基因標示之研究

A Study on Gene Annotation from Biological Literature

指導教授 : 陳信希

摘要


為了幫助生物學者可以從快速成長的網際網路上迅速有效地了解所需基因的相關資訊,本論文針對生物文件的基因標示,探討相關課題。例如:增加基因辨認的效能、分類出資料庫管理員有興趣的文件、擷取基因功能、標示基因本體等並提出相關的解決辦法,以便整合基因標示於資料庫中。   為了擷取生物文件的基因資訊,辨認基因是最基本的步驟。針對增加基因辨認的效能,我們提出混合的策略:過濾及整合策略。在實驗中,我們可以自動的探勘出常和基因出現的詞語,並應用於過濾不可能的基因候選者。另外,為了提高回收率,我們使用了整合策略。實驗顯示此種混合策略可以提高原有的辨認效能。當我們可以正確標示基因後,另一個重要的課題就是取出資料庫管理員有興趣的文件。本論文針對基因本體探討不同層級的分類方式。我們利用文章的三個部分─(1)標題與摘要,(2)MeSH term,(3)圖表標題,及UMLS的語意網路在SVM上訓練,實驗結果顯示相當好的效能。另一方面,擷取基因功能是瞭解基因的重要方式。目前,透過Entrez gene database的GeneRIF項目,可以用人工方式建立基因功能。我們提出兩種方法─(1)功能擷取方法及(2)機器學習方法從文件中自動產生GeneRIF,這對傳統的人工傳送GeneRIF方式有很大的幫助。   最後,為了將文件中的資訊整合到資料庫,我們將基因的特性以標準的字彙來表示,本論文選擇目前十分普遍的標準字彙:Gene Ontology (GO)。在GO標示的研究上,研究人員往往位於不同的層級進行標示:(1)文件層級及(2)基因層級。前者是標示出文章所具有的GO,而後者更明確描述是哪一個基因在這篇文章中具有的GO標示。本論文對這兩個層級分別加以探討。在文件層級上提出相關性偵測的方法,而在基因層級上提出密度模組及重力模組。在文件層級上,我們將先前抽出來的GeneRIF視為標示GO的證據,讓資料庫管理員參考。而在基因層級上,除了利用基因與GO的鄰近關係外,我們還應用物理的萬有引力定律,實驗結果證明密度與重力關係在GO標示上同樣都是很好的特徵。

並列摘要


In this dissertation we study various issues which will help biologists obtain relevant gene information from the rapidly growing body of online material in the biomedical field. Namely, our studies aim to improve the performance of biomedical named entity recognition, to classify the relevant documents for database curators, to improve gene function extraction and Gene Ontology annotation. We propose some approaches for each issue. Our final goal is to integrate the information extracted from the biological literature into the existing databases. Given biomedical documents, it is fundamental to recognize the biomedical entities first. For improving the performance of biomedical entity recognition, we introduce a hybrid strategy for a filtering strategy and an integration strategy. We show a fully automatic method of mining collocates from scientific texts in the protein and gene domain and applying collocates to filter out unlikely protein/gene candidates. Furthermore, we use the integration strategy to increase recall rates. The experimental results demonstrate this hybrid strategy performs better than the original protein/gene taggers. After biomedical entities are recognized, another important issue is to retrieve the relevant documents for database curators so that this information can be added to the existing database. The dissertation also investigates different granularities of classification for GO annotation. We utilize the three parts of an article, i.e., (1) titles and abstracts, (2) Mesh terms and (3) captions of tables and figures, as well as the semantic network of UMLS as features for SVM. Evaluation results demonstrate overall high performance in this work. Thirdly, gene function extraction is essential for biologists to understand genes. Currently, researchers can manually submit GeneRIFs in the Entrez gene database. We propose two approaches, a "function extraction approach" and a "machine learning approach" to automatically extract GeneRIFs from the curatable documents generated in the previous step. The experimental results are promising. Finally, in order to integrate the extracted information into the database, it is necessary to present genes with standard vocabularies. We use the highly popular controlled vocabularies, Gene Ontology (GO), in this dissertation. Researchers usually do GO annotation at different levels, i.e., "document level" and "gene level." The former annotates the GO terms in the document without identifying the relevant genes while the latter explicitly identifies the annotation of genes, GO terms and documents. This dissertation explores GO annotation at both levels. At the document level, we annotate genes by the relevance detection approach. At the gene level, we introduce density and gravitation models. Moreover, we utilize GeneRIFs extracted in the previous stage as the references for annotating GO terms at the document level. It will be of great help for database curators. In addition, we explore the proximity of genes and GO terms in the paragraph at the gene level. Our experiments show that density and gravitation relationships are good features for GO annotation.

參考文獻


Al-Shahrour, F., Diaz-Uriarte, R. and Dopazo, J. (2004) FatiGO: a Web Tool for Finding Significant Associations of Gene Ontology Terms with Groups of Genes, Bioinformatics, 20, pp. 578-580, 2004.
Adamic, L.A., Wilkinson, D., Huberman, B.A. and Adar, E. (2002) A Literature Based Method for Identifying Gene-Disease Connections, IEEE Computer Society Bioinformatics Conference (CSB'02) 2002, 109-117, 2002.
Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M. et al. (2000) Gene Ontology: Tool for the Unification of Biology, Nature Genetics, 25, pp. 25-29, 2000.
Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000, Nucleic Acids Research, 28, pp. 45-48, 2000.
Baum, L.E., Petrie, T., Soules, G. and Weiss, N. (1970) A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains, The Annals of Mathematical Statistics, 41(1), pp. 164-171, 1970.

延伸閱讀