台灣自1995年起開始實施全民健康保險制度,期望經由財源調節來避免人民可能難以支付其必要之醫療服務的窘境。在此制度之下,醫療單位必須根據第九版修訂之國際疾病分類碼(ICD-9-CM)向中央健康保險局申請民眾醫療費用的給付。因此醫療單位需聘請專員依據病人的出院病歷來進行編碼,而此人工編碼的過程是相當耗時且乏味的,所幸自動分類的方法可望有效率地幫助該編碼過程的順利完成。為了改善ICD-9-CM自動編碼過程,本研究探究了三種知名方法:貝氏演算法(Naïve Bayes)、支援向量機(support vector machine; SVM)和向量空間模型(vector space model; VSM),以及詞頻(term frequency; TF)和詞頻-逆向文件頻率(TF multiplied by the inverse document frequency; TF-IDF)等兩種特徵選取方法,使用台灣南部某醫學中心等級醫院之六個醫療科別的電子出院病歷進行研究。本研究同時探究加入本體論(ontology)的同義詞替代對編碼準確度的影響。實驗結果顯示沒有使用特徵選取的支援向量機是表現最好的方法,而結合0.1門檻值的詞頻-逆向文件頻率特徵選取的向量空間模型則只適用於心臟血管科。儘管詞頻-逆向文件頻率的特徵選取比詞頻特徵選取改進了一些效率,加入本體論的同義詞替代並沒有非常有效地增進編碼預測效率。總而言之,支援向量機方法被推薦使用於ICD-9-CM的自動編碼過程。
In 1995, Taiwan's government initiated the National Health Insurance (NHI) program in order to marshal resources to resolve difficulties that people may encounter when paying for health care. Under this program, most medical organizations apply for medical treatment fees from Bureau of the NHI according to diagnosis-related group (DRG) codes based on the International Classification of Disease, 9th Version, Clinical Modification (ICD-9-CM). The application process requires specialists to distinguish ICD-9-CM codes using the discharge diagnoses of doctors. This process is inefficient, time-consuming and tedious, especially when performed manually. These problems can potentially be reduced, using automatic classification methods.To improve the efficiency of ICD-9-CM predictions, we explored three well-known methods: Naïve Bayes, support vector machine (SVM) and vector space model (VSM) with term frequency (TF) and TF multiplied by the inverse document frequency (TF-IDF), respectively weighted for feature selection in the discharge diagnoses used by six hospital departments. This paper also explores whether use of an ontology influences prediction accuracy. The experimental results show that the preferred method is SVM without feature weighting, although hospital departments show a mean macro-averaged F-measure score (F) of 0.7937, which varies from 0.7374 to 0.9009. Based on the selected hospital departments, VSM with TF-IDF with a threshold 0.1 was only appropriate for the cardiology department, while the models for the other departments were not modified. Regarding usage of an ontology, synonym replacement does not work very efficiently, although TF-IDF showed less improvement than TF. In summary, SVM is recommended to predict ICD-9-CM.