透過您的圖書館登入
IP:18.222.148.124
  • 學位論文

利用依存句法於生物醫學關係萃取之表示學習

Representation Learning for Biomedical Relation Extraction with Dependency Parsing

指導教授 : 魏志平
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


關係萃取的任務是從文本中自動學習、抽取兩個實體間的關係。近年來,神經網路模型被廣泛應用在關係萃取上,也取得了優異的表現。然而,神經網路需要大量的訓練資料,而在生醫領域,因為標記成本昂貴,缺乏大量的訓練資料,所以我們進一步探索只需要少量標記資料來微調模型的自監督式學習方法。 MTB 是一個利用自監督式學習方法的關係萃取模型,藉由相同兩實體組成的實體對(entity pair)出現在不同句子也可能隱含相同關係的假設,MTB 得以訓練任意兩實體間的關係向量表示。不像過去許多深度學習之關係萃取模型,MTB 並未利用額外的自然語言特徵,故我們認為若加入兩實體間的依存路徑資訊,有機會讓 MTB 訓練得更好。另外,由於 MTB 僅利用不同的兩實體對是否相同當作訓練依據,負面樣本(非完全相同的實體對)的選定格外重要,因此,我們認為除了 MTB 提出的兩種負面樣本外,還存在使 MTB 訓練更有效的負面樣本。 因此,基於 MTB 模型,我們提出兩個改善方向:(1) 藉由四種網路模組編碼並嵌入實體對之間的依存關係 (2) 藉由行內(inline)負樣本,使 MTB 模型不能只學會關鍵字匹配,而作為真正學到基於上下文的關係表示。在不同設置的實驗下,我們證明了相對於 MTB 原本架構,我們提出的兩個改善方向都能有效地提升關係萃取的效能。我們並探索了在簡單或複雜的句法關係下,更適合的依存神經網路模組,也證明了在更細粒度的方向性關係下,我們的模型仍能有效辨別並超越 MTB 原始架構的表現。

並列摘要


Relation extraction is the task that learns and extracts relations between entities from the text. In recent years, neural network models have been widely used in relation extraction, and have achieved the state-of-the-art performance. However, neural networks require a large amount of training data. In the biomedical domain, because acquiring labeled instances is expensive and the training dataset is often small-sized, we further explore self-supervised learning methods that require only a small amount of labeling data for fine-tuning the model. Matching The Blank (MTB) is a self-supervised based relation extraction model. With the assumption that if two entity pair from two sentences are the same, it also implies that they are having the same relation, MTB can train the vector of relation representation between any two entities. However, unlike many deep learning relationship extraction models in the past, MTB does not use additional natural language features other than text. Hence, we believe that if the dependency parsing information between the two entities in a sentence is taken into account, there is an opportunity for MTB to be trained better. In addition, since negative samples play an important role in MTB training, the selection of negative samples (non-identical entity pairs) is particularly important. Therefore, we believe there exists a new type of negative samples that is more effective for MTB training. Therefore, based on the MTB model, we propose two directions for improvement: (1) four neural network modules to encode the dependency relationship between entities, and (2) inline negative samples that the MTB model will not just learn to do keyword matching, but will truly learn context-based relation representation. With the various experiment settings for robustness, we prove that compared with the original structure of MTB, the two directions that we propose can improve the effectiveness of relation extraction. We also explore more suitable dependency modules under simple or complex dependency relationships of an entity pair, and also prove that under more fine-grained directional relations, our model can still effectively identify and outperform the original structure of MTB.

參考文獻


Agichtein, E. and Gravano, L. (2000). Snowball: Extracting relations from large plain-text collections. Proceedings of the Fifth ACM Conference on Digital Libraries (DL).
Aronson, A. R. and Lang, F.-M. (2010). An overview of Metamap: historical perspective and recent advances. Journal of the American Medical Informatics Association, 17(3):229–236.
Asahara, M. and Matsumoto, Y. (2003). Japanese named entity extraction with redundant morphological analysis. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pages 8–15.
Baldini Soares, L., FitzGerald, N., Ling, J., and Kwiatkowski, T. (2019). Matching the blanks: Distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2895–2905, Florence, Italy. Association for Computational Linguistics.
Bick, E. (2004). A named entity recognizer for Danish. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 305–308.

延伸閱讀