以詞嵌入與概念擷取方法進行生物醫學縮寫的詞義消歧

在醫院病歷資料與醫學文獻中，英文縮寫經常被使用。由於許多縮寫擁有多種展開形式，使它們在詞義上模稜兩可，因此縮寫的詞義消歧（Word Sense Disambiguation, WSD）成為了自然語言處理（Natural Language Processing, NLP）領域中的一個重要課題。在這篇論文中，我們提出了一個監督式機器學習的方法來解決這項問題。首先，我們使用了一個事先訓練好的詞向量（Word Embedding）模型和一個一體化醫學語言系統（Unified Medical Language System, UMLS）的概念擷取（Concept Extraction）工具，來建造四種不同的特徵（features）：詞向量特徵（word embedding features）、UMLS概念名稱特徵（UMLS concept preferred name features）、UMLS概念原文字詞組特徵（UMLS concept n-gram features）和詞性特徵（part-of-speech features）。接下來，我們選擇了支持向量機（Support Vector Machine, SVM）作為進行機器學習的模型。在我們以美國明尼蘇達大學（University of Minnesota, UMN）的一個公開資料集進行訓練與測試之後，我們能夠以最好的特徵組合與參數組合，在完整75個縮寫的資料集中得到97.17%的準確率，在部分50個縮寫的資料集中獲得96.97%的準確率，並且在部分13個縮寫的資料集中得到98.50%的準確率。最終，相較於其它論文中使用的方法，我們提出的方法能夠得到更好的表現，因此證明了本篇論文的實用性。

關鍵字

詞嵌入；概念擷取；詞義消歧；自然語言處理；機器學習；一體化醫學語言系統

並列摘要

Abbreviations are often used in clinical notes and biomedical articles, and the fact that many of them are ambiguous in meaning makes identifying the correct expansion for an abbreviation a vital word sense disambiguation (WSD) task in the natural language processing (NLP) area. In this study, a supervised machine learning solution is proposed for this problem. First, we utilized a pre-trained word embedding model and a Unified Medical Language System (UMLS) concept extraction tool to construct four kinds of features for target sentences: word embedding features, UMLS concept preferred name features, UMLS concept n-gram features and part-of-speech features. Next, we chose Support Vector Machines (SVMs) as our machine learning models. After training and testing with a public dataset from the University of Minnesota (UMN), we were able to get an accuracy of 97.17% for the full dataset of 75 abbreviations, 96.97% for a subset of 50 abbreviations, and 98.50% for a subset of 13 abbreviations with the best features and SVM parameter settings. In the end, we were able to outperform other researchers' method, thus proving our solution to be effective.

並列關鍵字

word embedding ； concept extraction ； word sense disambiguation ； natural language processing ； machine learning ； UMLS

參考文獻

[17] Jaber, Areej and Mart ́ınez, P. (2021). Disambiguating Clinical Abbreviations using Pre-trained Word Embeddings. In Proceedings of the 14th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5 HEALTHINF: HEALTHINF, ISBN 978-989-758-490-9, pages 501-508. DOI: 10.5220/0010256105010508

[1] Xu, H., Stetson, P. D., & Friedman, C. (2007). A study of abbreviations in clinical notes. In AMIA annual symposium proceedings (Vol. 2007, p. 821). American Medical Informatics Association.

Google Scholar

[2] Bodenreider, O. (2004). The unified medical language system (UMLS): in-tegrating biomedical terminology. Nucleic acids research, 32(suppl 1), D267- D270.

Google Scholar

[3] Liu, H., Lussier, Y. A., & Friedman, C. (2001). A study of abbreviations in the UMLS. In Proceedings of the AMIA Symposium (p. 393). American Medical Informatics Association.

Google Scholar

[4] McInnes, B. T., Pedersen, T., & Carlis, J. (2007). Using UMLS Concept Unique Identifiers (CUIs) for word sense disambiguation in the biomedical domain. In AMIA annual symposium proceedings (Vol. 2007, p. 533). American Medical Informatics Association.

Google Scholar

國際替代計量

以詞嵌入與概念擷取方法進行生物醫學縮寫的詞義消歧

查找全文

主題瀏覽