透過您的圖書館登入
IP:18.117.142.128
  • 學位論文

以機器學習方法預測大腸直腸癌病患之第二原發惡性腫瘤

Machine Learning and Prediction of Second Primary Malignancy in Colorectal Cancer Survivors

指導教授 : 張浤榮

摘要


研究目的: 大腸直腸癌是全球發生率排名第三位與死亡率第四名的癌症。病患在治療追蹤過程,常常會發現第二個癌症產生,稱為第二原發惡性腫瘤。這不僅造成治療的困擾,也會影響病患的存活。由於有效多變的內外科治療,大腸直腸癌病患存活時間延長,癌症復發與第二原發惡性腫瘤成為臨床常見問題。過去的研究證實許多危險因子與癌症復發有關,但是對於第二原發惡性腫瘤仍缺乏完整的討論。因此,本研究目的為探討大腸直腸癌的第二原發惡性腫瘤發生之相關危險因子以及其重要性。 研究方法及資料: 研究方法選擇五種機器學習分類器為:支持向量機(Support vector machine, SVM)、極限學習機(Extreme learning machine, ELM)、多元適應性雲形迴歸(Multivariate adaptive regression splines, MARS)、隨機森林(Random forest, RF)與極端梯度增強(eXtreme Gradient Boosting, XGBoost),經由文獻查證與參考臨床專家意見,選擇14個資料庫欄位紀錄,做為可能與第二原發惡性腫瘤有關的變數。研究設計區分為兩階段順序模式,執行五種機器學習分類器的績效分析與兩階段順序模式變數重要性的分析。採用準確率(Accuracy)、敏感性(Sensitivity)、特異度(Specificity)、ROC(Receiver Operating Characteristic)曲線與AUC(Area Under Curve)作為分類器的績效評估。 研究結果: 本研究資料來源來自台灣三家醫院的癌症登記資料庫,有效記錄資料共計4287筆。研究結果顯示危險因子的重要程度依序是:整併期別、診斷時的年齡、身體質量指數、原發部位的手術切緣、腫瘤大小、性別、局部淋巴結陽性、腫瘤分化程度、腫瘤原發位置和飲酒行為。在變數篩選前的單階模型執行結果顯示:準確率最高為S-RF (0.819),敏感性最高為S-XGboost (0.709),特異性最高為S-RF (0.565), AUC值最高為S-SVM (0.711);在變數篩選後的兩階模型執行結果顯示:準確率最高為A-MARS (0.731),敏感性最高為A-XGboost (0.767),特異性最高為A-MARS (0.772) ,AUC值最高為A-XGboost (0.714)。 結論與建議: 本研究提出機器學習兩階段順序模式,找出可以準確預測大腸直腸癌病患發生第二原發惡性腫瘤的機器學習模型以及危險因子。其中整併期別與大腸直腸癌的第二原發惡性腫瘤相關性最高;其他危險因子包括,診斷年齡、BMI、原發部位的手術切緣、腫瘤大小、性別、區域淋巴結陽性和原發部位也具備相對重要性。研究結果可以協助臨床醫師制定早期發現第二原發惡性腫瘤的最佳監測時程,用以提高疾病預後和制訂有效的治療決策。最後,對於病患的遺傳、追蹤時間、共病與輔助治療,未來研究可以納入參數進一步完整探討。

並列摘要


Objective: Colorectal cancer (CRC) is the third most commonly occurring and fourth most deadly cancer in the world. In addition, CRC patients have an increased risk of second primary malignancy (SPM) during the longer following-up period. This will not only cause treatment problems, but also affect the survival rate. Due to high effectiveness of variable medical and surgical treatments of CRC, the recurrence and SPM have increased. Previous studies have focused on the risk factors of recurrence. However, few studies have paid attention to the risk factors of SPM. Therefore, this study was aimed to rank the importance of risk factors of the SPM in CRC patients Methods and Materials: The Support Vector Machine (SVM), Extreme Learning Machine (ELM), Multivariate Adaptive Regression Splines (MARS), Random Forest (RF) and eXtreme Gradient Boosting (XGBoost) classifiers, were integrated to verify the proposed predictive model to predict the SPM of CRC patients. On the basis of the comments by the expert committee and literatures, 14 predictor variables selected in this study were considered to be associated with the risk factors for SPM. The research design is divided into a two-stage sequential model. In the single model, all 14 risk factors were directly used as predictors for SVM, RF, MARS, ELM, and XGboost for constructing five single classification models. In the two-stage model, after obtaining the average rank of each risk factor, the overall important risk factors should be identified before constructing a classification model. In addition, the performance of each classifier was evaluated using accuracy, sensitivity, specificity, ROC (Receiver Operating Characteristic), and AUC (Area Under Curve). Results: A total of 4287 patients in the datasets provided by three hospital tumor registries were used. Among the five single classification methods, the S-RF provided the best accuracy rate with 0.819, the S-XGboost provided the best sensitivity with 0.709, the S-RF provided the best specificity with 0.505 and the S-SVM provided the best AUC with 0.711. In the two-stage model, the weight of 10 important risk factors in descending order are combined stage, age at diagnosis, BMI, surgical margin, tumor size, gender, regional lymph nodes metastasis, tumor differentiation, tumor location and alcohol consumption behavior. The A-MARS provided the best accuracy rate with 0.731, the A-XGboost provided the best sensitivity with 0.767, the A-MARS provided the best specificity with 0.772 and the A-XGboost provided the best AUC with 0.714. Conclusion and Suggestion: This study proposed a two-stage sequential model of machine learning to predict the risk factors of SPM in CRC patients. The combined stage is the most relevant to the SPM; other risk factors include age at diagnosis, BMI, surgical margin of the primary site, tumor size, gender, number of positive regional lymph nodes and primary tumor location also have relative importance. The results can help clinicians in early detection the SPM, improve the prognosis, and make effective medical decisions. Future studies including the genome, following duration, comorbidities, and adjuvant treatment will be conducted with comprehensive discussion.

參考文獻


1. Li F, Zhong W-Z, Niu F-Y, Zhao N, Yang J-J, Yan H-H, Wu Y-L: Multiple primary malignancies involving lung cancer. BMC cancer 2015, 15(1):696.
2. Lv M, Zhang X, Shen Y, Wang F, Yang J, Wang B, Chen Z, Li P, Zhang X, Li S: Clinical analysis and prognosis of synchronous and metachronous multiple primary malignant tumors. Medicine 2017, 96(17).
3. Huang C-S, Yang S-H, Lin C-C, Lan Y-T, Chang S-C, Wang H-S, Chen W-S, Lin T-C, Lin J-K, Jiang J-K: Synchronous and metachronous colorectal cancers: distinct disease entities or different disease courses? Hepato-gastroenterology 2015, 62(138):286-290.
4. Xu L, Gu K: Clinical retrospective analysis of cases with multiple primary malignant neoplasms. Genet Mol Res 2014, 13(4):9271-9284.
5. Global Burden of Disease Cancer C, Fitzmaurice C, Akinyemiju TF, Al Lami FH, Alam T, Alizadeh-Navaei R, Allen C, Alsharif U, Alvis-Guzman N, Amini E et al: Global, Regional, and National Cancer Incidence, Mortality, Years of Life Lost, Years Lived With Disability, and Disability-Adjusted Life-Years for 29 Cancer Groups, 1990 to 2016: A Systematic Analysis for the Global Burden of Disease Study. JAMA Oncol 2018, 4(11):1553-1568.

延伸閱讀