在醫院病歷資料與醫學文獻中,英文縮寫經常被使用。由於許多縮寫擁有多種展開形式,使它們在詞義上模稜兩可,因此縮寫的詞義消歧(Word Sense Disambiguation, WSD)成為了自然語言處理(Natural Language Processing, NLP)領域中的一個重要課題。在這篇論文中,我們提出了一個監督式機器學習的方法來解決這項問題。首先,我們使用了一個事先訓練好的詞向量(Word Embedding)模型和一個一體化醫學語言系統(Unified Medical Language System, UMLS)的概念擷取(Concept Extraction)工具,來建造四種不同的特徵(features):詞向量特徵(word embedding features)、UMLS概念名稱特徵(UMLS concept preferred name features)、UMLS概念原文字詞組特徵(UMLS concept n-gram features)和詞性特徵(part-of-speech features)。接下來,我們選擇了支持向量機(Support Vector Machine, SVM)作為進行機器學習的模型。在我們以美國明尼蘇達大學(University of Minnesota, UMN)的一個公開資料集進行訓練與測試之後,我們能夠以最好的特徵組合與參數組合,在完整75個縮寫的資料集中得到97.17%的準確率,在部分50個縮寫的資料集中獲得96.97%的準確率,並且在部分13個縮寫的資料集中得到98.50%的準確率。最終,相較於其它論文中使用的方法,我們提出的方法能夠得到更好的表現,因此證明了本篇論文的實用性。
Abbreviations are often used in clinical notes and biomedical articles, and the fact that many of them are ambiguous in meaning makes identifying the correct expansion for an abbreviation a vital word sense disambiguation (WSD) task in the natural language processing (NLP) area. In this study, a supervised machine learning solution is proposed for this problem. First, we utilized a pre-trained word embedding model and a Unified Medical Language System (UMLS) concept extraction tool to construct four kinds of features for target sentences: word embedding features, UMLS concept preferred name features, UMLS concept n-gram features and part-of-speech features. Next, we chose Support Vector Machines (SVMs) as our machine learning models. After training and testing with a public dataset from the University of Minnesota (UMN), we were able to get an accuracy of 97.17% for the full dataset of 75 abbreviations, 96.97% for a subset of 50 abbreviations, and 98.50% for a subset of 13 abbreviations with the best features and SVM parameter settings. In the end, we were able to outperform other researchers' method, thus proving our solution to be effective.