透過您的圖書館登入
IP:18.222.213.240
  • 期刊

比較三種資料探勘演算法預測子宮頸癌五年存活的外部通用性效能

Predicting Cervical Cancer Survivability: A Comparison of Three Data Mining Methods

摘要


本研究比較類神經網路、邏輯斯迴歸及決策樹三種資料探勘演算法,使用不同診斷年份的樣本作模型訓練,對預測子宮頸癌五年存活情形的效能,並進行外部通用性(External Generalization)驗證。 本研究採用美國國家癌症研究所(NCI: National cancer Institute)所提供的流行病學調查(SELR: the Surveillance, Epidemiology, and End Results)數據中的癌症登記資料庫(CIPUD, Cancer Incidence Public-use Database),從西元1973年至西元2000年間選取156,502筆資料記錄及72個變項,經過資料清理後,留下與預測子宮頸癌五年存活較相關的18個變項,與子宮頸癌診斷年份爲1988-1996年的資料共2,022筆,依診斷年份將樣本,分成8組不同的模型訓練樣本與測試樣本,帶入類神經網路(artificial neural network)、決策樹(decision tree)以及邏輯斯迴歸(logistic regression)三種演算法造出模型,以AUC (area under the ROC curve)、準確率(accuracy),作爲演算法預測能力評估,並找出可以得到良好預測結果的模型設計。 結果顯示:內部驗證的模型預測力最好的爲類神經網路的模型1,其AUC與準確率值分別爲0.9392、0.9474。外部驗證的AUC結果,以類神經網路的模式7表現最好,其值分別爲0.6455。在內部驗證(internal validation)的AUC與準確率結果表現,類神經網路與決策樹都較邏輯斯迴歸佳。在外部驗證(external validation)的AUC結果表現,類神經網路與邏輯斯迴歸都較決策樹好。 類神經網路與邏輯斯迴歸建造的模型,有較好的外部通用性,而類神經網路與決策樹建造的模型,有較好的模型準確率。若想要得到較好的外部驗證結果,訓練樣本可以取過去的2-3年以上的資料。

關鍵字

無資料

並列摘要


The purpose of the study was to compare the performances of an artificial neural network (ANN), decision tree (C5), and logistic regression (LR) for predicting the 5-year survivability of cervical cancer and their external validation for generalization. The data was collected from SEER (Surveillance, Epidemiology, and End Results) of the NCI (National Cancer Institute) in the United States during the years 1973~2000. There were 156,502 cases with 72 variables. After the data was cleaned, there were 2,022 cases and 18 variables remaining during years 1988~1996. The dataset was divided into 8 categories of training sets and test sets, according to the year the patients were diagnosed. The 8 training sets were applied to three algorithms: 1) ANN, 2) C5, and 3) LR to build 8 models. The parameters of performance of the models were accuracy and AUC (Area under the ROC curve) for predicting 5-year survivability of cervical cancer patients. ANN had the best internal validation of the AUC and accuracy (AUC, 0.9392; accuracy, 0.9474) on model 1 and the best external validation of the AUC (0.6455) on model 7. ANN and C5 outperformed LR with respect to internal validation. ANN and LR both performed better than C5 in the external validation of the AUC. All in all, algorithms of ANN and LR performed better for external generalization, and algorithms of ANN and C5 performed more accurately for classification.

參考文獻


何子銘、盧瑜芬、許家瑋(2006)。運用三種資料探勘方法預測子宮頸癌存活情形之比較。台灣家醫誌。16,192-203。
世界衛生組織

被引用紀錄


陳澤明(2015)。結合科技接受模式與資料採礦方法進行智慧型電視之購買預測〔碩士論文,國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2015.00455
吳娟(2012)。運用資料探勘技術預測末期病人短期存活時間〔碩士論文,元智大學〕。華藝線上圖書館。https://doi.org/10.6838/YZU.2012.00047
林美雀(2009)。子宮頸癌病患存活情形之預測〔碩士論文,元智大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0009-2101200914503600
張昭威(2010)。運用資料探勘方法建構乳癌預後模式〔碩士論文,朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-0601201112113721
林裕森(2011)。運用不同階段檢驗項目建構急性腎衰竭病患之預後模型〔碩士論文,朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-1511201110382713

延伸閱讀