透過您的圖書館登入
IP:18.223.21.5
  • 學位論文

生物醫學領域的關係萃取: 分佈相似度需要更多資訊

Representation Learning for Biomedical Relation Extraction: Distributional Similarity Needs More Information

指導教授 : 魏志平

摘要


關係提取是指從文本片段(例如句子)中識別實體之間的關係之任務。然而,在生物醫學領域,由專家標註實體之間的關係的成本是很昂貴的,所以在本論文中,我們進一步探索自我監督的學習方法,從未標記的文本片段中獲得監督信號。Matching The Blanks (MTB)是一種自我監督的學習方法,通過判斷兩個句子是否包含相同的實體對(entity pair),如果是,可能就代表了語義上相似的關係,因而在文本片段的嵌入空間中向量的分佈相似度會較高,反之亦然。2020年Chu學者提出兩種改良的機制進一步改進了MTB:一個是加入了更難的負面訓練樣本,即指同一句子中包含的不同實體對,另一個則是將依賴路徑資訊加入模型中。另一方面,2019年Kuo學者和2020年Chang學者提出的生醫關係提取的深度學習模型,分別為PCNN-GT和KG-PCNN-GT,透過他們的實驗結果可以證明,額外的特定領域特徵可以提高生物醫學領域關係提取的表現。因此,我們認為這對MTB的訓練也是有幫助的。我們認為Chang學者提出的模型在計算上是較沒效率的,另一方面Chu學者所提出的方法在萃取生醫領域的關係向量表示提供了一個很好的基礎。 所以這項研究中,我們在Chu學者提出的方法基礎上,提出了兩個新的模型,我們命名為A-MIGRATE和A-MIGRATE-Dep。我們提出的兩個模型包含四種類型的特徵:實體特徵、背景特徵、GT特徵和依賴特徵。我們的實驗顯示了這四種特徵的有效性,並得出結論:不同類型的特徵可以為關係語句表示的學習提供不同的資訊。此外,我們的模型具有相當的競爭力,甚至優於只在標記數據上訓練的關係分類模型,比如先前研究者所提出的PCNN-GT和KG-PCNN-GT,而且研究結果顯示我們模型的表現明顯優於所有其他基準方法。

並列摘要


Relation extraction is the task that automatically identifies the relation between entities from a text segment. Neural network model has been widely used for relation extraction in recent years. However, annotating relations between entities by domain experts is expensive in biomedical domains, so we further explore a self-supervised learning method that obtains supervisory signals from unlabeled data. Matching The Blanks (MTB) (Soares et al., 2019) is a self-supervised learning method that can learn the relation representation between two entities by declaring that when two sentences containing the same pair of entities, they might represent semantically similar relations and therefore their vector representations in the embedding space should be close to each other (that is, their distribution in the vector space will be similar), and vice versa. Chu (2020) further improved the MTB method by two ways: one is the incorporation of harder negative examples that refer to different entity pairs from the same sentence, another is adding the dependency path information into the model. On the other hand, the biomedical relation extraction models that Kuo (2019) and Chang (2020) proposed, refer to as PCNN-GT and KG-PCNN-GT respectively, have empirically shown that additional domain-specific features can improve the performance of relation extraction in the biomedical domain. Thus, we will consider domain-specific features in our proposed self-supervised learning method for learning the relation representation of a sentence and two entities appearing in the sentence. Chu (2020) reported a satisfactory effectiveness on the extraction of the biomedical relation statement representation, and the model that Chang (2020) proposed is computational inefficient. As a result, based on the method that Chu (2020) proposed, we propose two novel models, refer to as A-MIGRATE and A-MIGRATE-Dep. The two models that we propose contain four types of features: entity features, context features, GT features, and dependency features. Our experiments have shown the effectiveness of the four features, and obtain a conclusion that different types of features can provide different information to the relation statement representation learning. Furthermore, our model is competitive and even better than the relation classification models that are trained only on labeled data, such as PCNN-GT and KG-PCNN-GT, and we also outperform all the other benchmark methods significantly.

參考文獻


Bodenreider, O. (2004). The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(suppl 1):D267–D270.
Cardie, C. (1997). Empirical methods in information extraction. AI Magazine, 18(4):65–65.
Chang, C.-H. (2020). Biomedical relation extraction supporting by knowledge graph embedding model. Unpublished Master Thesis, Department of Information Management, National Taiwan University, Taiwan, ROC.
Chu, Y.-C. (2020). Representation learning for biomedical relation extraction with dependency parsing. Unpublished Master Thesis, Department of Information Management, National Taiwan University, Taiwan, ROC.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

延伸閱讀