癌症已蟬聯多年國人10大死因之首,而隨著醫療科技的進步,應用資訊科技與機器學習的技術於醫療資料的分析,尤其是癌症資料的分析已為目前重要的課題。透過資料探勘與機器學習的技術方法可以協助相關危險因子的分析,並建構癌症復發的分析模式,提供有益治療的相關信息,進而提升癌症病人存活率。由於癌症復發資料的組成結構具有類別不平衡(Class Imbalance)問題,本研究將使用資料探勘技術來處理資料不平衡、特徵選取及癌症復發預測等問題。使用四種處理資料不平衡(Inbalanced Data, ID)方法為過採樣法(Over-Sampling)、欠採樣法(Under-Sampling)、代價敏感學習法(Cost -Sensitive Learning, CSL)、人工數據合成法(Synthetic Data Generation, SDG)作為抽樣工具。而特徵選取(Feature Selection, FS)方法分別為內嵌法(LASSO)、封裝法(Wrapper)、過濾法(Filter)、逐步剃除法(VarSelRF)作為變數選取技術,在疾病預測模型上,本研究以決策樹(Decision Tree, DT)、隨機森林(Random Forest, RF)、羅吉斯迴歸(Logistic Regression, LR)及多元適應性雲形迴歸(Multivariate Adaptive Regression Splines, MARS)作為分類工具。並評估不同疾病危險因子所建立之模型,藉由疾病預測模型之分類績效找出重要的疾病危險因子,以驗證醫療的特徵影響因素分類模式之準確性,及定義最佳分類的結果。 實證結果顯示,透過資料平衡與特徵選取等資料探勘的方法比單一判別模型有較佳的分類結果,且先使用資料平衡後在特徵選取後有更佳的分類結果。其中以CSL-F-MARS、SDG-F-MARS的預測結果最佳,其分類指標均高於其他疾病危險因子組合,代表所提方法及所篩選出來的疾病危險因子的有效性。當使用CSL-F-MARS、SDG-F-MARS在建置癌症復發之模型時,可建構出較佳之模型。 關鍵詞:癌症復發、機器學習、特徵選取、資料不平衡、分類技術。
Cancer has been at the top of the list of the top 10 causes of death in people for many years. And with the advance of medical technology, application of information technology and machine learning technology for the analysis of medical data, especially the analysis of cancer data has become an important topic at present. Through technical methods of data exploration and machine learning can assist in the analysis of relevant risk factors, and construct an analytical model of cancer recurrence, provide information on beneficial treatments, to improve the survival rate of cancer patients. There is a class imbalance problem due to the composition of cancer recurrence data, this study will use data prospecting techniques to deal with data imbalance, feature selection and cancer recurrence prediction. Using four methods of processing data imbalance (Inbalanced Data, ID) are over-sampling, under-sampling, Cost-Sensitive Learning(CSL), Synthetic Data Synthesis(SDG) as a sampling tool. The feature selection (Feature Selection, FS) methods are inline (LASSO), encapsulation method (Wrapper), filter method (Filter), and gradual shaving (VarSelRF) as variable selection techniques. And the feature selection (Feature Selection, FS) methods are inline (LASSO), encapsulation method (Wrapper), filter method (Filter), and gradual shaving (VarSelRF) as variable selection techniques. In the disease prediction model, the study was conducted to Logistic Regression (LR), Decision Tree (DT), Random Forest (RF) and Multivariate Adaptive Regression Splines(MARS) as a classification tool. And to evaluate the models established by different disease risk factors, to identify important disease risk factors by the difference of disease models, to determine the characteristic sequining of the factors of the disease, and to determine the best results of the optimal division. The results showed that the methods via data balancing and feature selection processes can provide better classification accuracy than the single model. Among the proposed methods, the CSL-F-MARS and SDG-F-MARS are the best mdels which can generate the best classification results and are the promsing schemes for cancer recurrence prediction. Keywords: Cancer Recurrence, Machine Learning, Feature Selection, Data Imbalance, Classification Technique.