近年來,社會的快速變遷,民眾為求便利易攝取一些高脂肪及低纖維的食物,攝取過多會造成大腸黏膜有刺激作用,這些刺激作用容易使大腸消化系統阻塞。根據行政院衛生署統計結果,顯示國人得到結腸、直腸癌(又稱大腸癌)的人數有逐年增加的趨勢,從1996年的2642人至2009年的4531人,因此,大腸癌對國人的影響已經不容忽視了。 本研究運用資料探勘的技術,以某醫學中心的全身健康檢查資料為樣本,探討全身健康檢查資料與大腸癌疾病的關聯性,並建構大腸癌預測模型。建構預測模型分成兩個階段(一)運用健康檢查資料分別使用區別分析及Logistic迴歸分析,從健檢資料中篩選出大腸癌重要危險因子。(二)將階段(一)所獲得的重要危險因子作為自變項,分別運用類神經網路及支援向量機建構罹患大腸癌的預測模型。 研究結果顯示,以相關Logistic迴歸分析結合支援向量機所建構的模型預測大腸癌罹患較準確,平均準確度為88.60%,敏感度及特異度分別為87.32%及75.76%,然而,以區別分析結合支援向量機所建構的模型較不受樣本資料中正常與異常比率懸殊影響,平均準確度為77.45%,敏感度及特異度分別為76.53%及76.50%。 關鍵字:全身健康檢查、資料探勘、區別分析、Logistic迴歸、類神經網路、支援向量機、大腸癌
In recent years, with the rapid change of the society, for the sake of convenience and easy, people started to take high fat and low fiber food. However, excessive intake can cause colon mucosa and have stimulating effect, which will stimulate the digestive system and tends to block the large intestine. According to DOH statistics results, showed that number of people getting colon cancer (also known as colorectal cancer) tend to increasing over the years. From 2462 of year 1996 to 4531 of year 2009.Therefore, the impact of colorectal cancer is in negligible. This study uses Data mining techniques, taking a medical center’s general health check information for sample. Our goal is to explore the correlations between physical examination data and disease associated with colorectal cancer. Also, we build a colorectal cancer predictive model. The construction of predictive model is divided into two stages, (1) by using difference and Logistic regression analysis methods; we sift out the important risk factors for the colon cancer from the health check data. (2) we set the important risk factors acquired from stage one as independent variables, and apply the neural networks and support vector machine to construct the colorectal cancer prediction model. The results show that, the model built with correlation Logistic Regression combined with Support Vector Machines prediction is more accurate, the average mean accuracy is 88.60%, and sensitivity and specificity were 87.32% and 75.76%. However, the model built with Discriminant Analysis combined with the best Support Vector Machines is less affected by the ratio of the normal and abnormal data in the sample, the mean accuracy of 77.45%, sensitivity and specificity were76.53% and 76.50%. Keywords:Physical examination, Discriminant Analysis, Logistic Regression, Artificial Neural Networks, Support Vector Machines, Colorectal cancer