子宮頸癌是全世界婦女最常罹患的癌症,雖然子宮頸癌的發生率與死亡率已明顯逐年下降,惟發生率仍高居女性癌症首位,且其死亡率仍居我國十大癌症之第八位。本研究運用決策樹C5.0演算法、決策樹CART演算法、支援向量機SMO演算法及邏輯斯迴歸等四種資料探勘工具,同時採用10折交叉驗證方式進行內、外部模型之驗證,建立子宮頸癌病患的存活情形模型,並挖掘出對存活情形之有效規則。 本研究利用全民健康保險研究資料庫中,以1999年至2002年申請子宮頸癌重大傷病作為本研究對象11,617人。不論就整體平均預測正確率及個別模型最高預測正確率而言,決策樹C5.0(80.83%)的預測能力皆優於邏輯斯迴歸(80.47%)、決策樹CART(80.29%)、支援向量機(79.93)等技術之預測能力。模型穩定度方面,則以支援向量機優於決策樹C5.0、邏輯斯迴歸、決策樹CART。 從資料探勘所得重要因素中,發現病患年齡、轉移至骨癌、肺癌、肝癌、抹片檢查、治療後產生尿毒症、腹膜炎及陰道直腸?等併發症等變數,都是影響子宮頸癌病患存活的重要預後因子。本研究所發現影響子宮頸癌病患之預後因子及其規則,提供此疾病於治療或預防上之參考依據。
Cervical cancer is the most common disease that strikes women in the world. Even though the morbidity and the mortality have been decreasing in recent years, the morbidity rates of cervical cancer are the first leading type and the mortality rates are the eighth of the top ten cancers in Taiwan. Data mining was used in this research including C5.0, CART, SVM and Logistic Regress algorithm to find the effective association rule and build up the model of survivability in cervical cancer. We also used 10-fold cross-validation methods to validate the training data and testing data. This study used data from the National Health Insurance database. Data includes 11,617 persons who have applied catastrophic illness of cervical cancer from 1999 to 2002.The results will respectively be indicated the accuracy of testing data as follows. Decision tree (C5.0) is the best predictor with 80.83% accuracy; Logistic Regression is came out to be the second with 80.47% accuracy ; Decision tree CART came out to be the third with 80.29% accuracy ; Support vector machine came out to be the fourth with 79.93% accuracy. In this study, we found there are many variables including age of patient, bone cancer, lung cancer, liver cancer, non-attendance Cervical Smear, peritonitis and uremia were important factors of the prognosis. We fond the rules and the factors of prognosis which would affect the survivability of the patients of survical cancer in this research. Therefore we can provide reference for patients on treatment or prevent.