應用資料探勘探討淋巴瘤病人特徵之研究

隨著資料數位化硬體成本的下降以及資訊科技軟體技術的成熟，蘊藏在數位資料中珍貴的知識，正被更深層的挖掘，為各產業帶來更多創新的活動與價值。癌症登記始於1920年代，以結構式標準化的格式，記錄癌症病人的資訊，記錄至今，不但累積為數龐大且可觀的資料量，更隱藏著過去醫療科學尚未發掘的知識。本研究使用美國SEER 2004~2015年登錄為淋巴瘤共427,743筆病人之資料集，應用資料探勘分類預測模型決策樹、隨機森林與羅吉斯迴歸演算法，分別各自以四種淋巴瘤、何杰金氏與非何杰金氏淋巴瘤及所有淋巴瘤的組合，比較在各演算法中分類預測的準確度，同時比較以機器學習(16項特徵因子)與臨床專家、文獻(10項特徵因子)選取特徵因子重要性的差異。研究結果機器學習與臨床專家、文獻在不同種類淋巴瘤與組合，分類預測模型決策樹、隨機森林與羅吉斯迴歸演算法中，準確度最高的為決策樹與隨機森林，二種演算法準確度結果並無明顯差異。機器學習中的四項特徵因子，在臨床專家、文獻進行特徵因子合併為二項，該特徵因子在機器學習與臨床專家、文獻皆為重要之特徵因子，顯示重要性越高的特徵因子，不因特徵因子呈現的方式不同，選取結果有所差異。而機器學習選取特徵因子次數越多者，在臨床專家、文獻以決策樹進行排序結果也越前面。

關鍵字

淋巴瘤；資料探勘；演算法

並列摘要

In the past decades, cost reduction of hardware and advances of information technology helped researchers finding valuable knowledge under digital data, and benefit industries. Initial cancer registration began in the 1920s, these data were recorded in structured and standardized form, including detailed information about patients. These records containing numerous information and precious knowledge. Hence, this study using the 427,743 lymphoma cases from SEER during 2004 to 2015, and applying 3 classification methods: (1) Decision Tree; (2) Random Forest; (3) Logistic Regression, to predict patients lymphoma classification of lymphoma, Hodgkin's lymphoma, non-Hodgkin's lymphoma, and combination lymphoma, then comparing the accuracies among these 3 methods. Furthermore, this study compared the accuracy difference from 2 sets of feature: (1) features selected by machine learning methods (16 features); (2) features selected through clinical experts and research papers (10 features). The research result suggests that decision tree and random forest approach equally high accuracy. With clinical expert opinions, this study integrated 4 features from machine learning into 2 features, these features were crucial in features from clinical experts and research papers. This result implied crucial features were essential in different feature selection methods. A high selected frequency of features by machine learning were also reached higher importance rank from features selected by experts and research paper in decision tree method.

並列關鍵字

Lymphoma ； Data Mining ； Machine Learning

參考文獻

[1] Stewart BW, Wild CP. (2015). National cancer control plans. In Simon B. Sutcliffe, World Cancer Report 2014 (pp. 529-537). Switzerland, Geneva: WHO Press. 2018, April 4, Retrieved from: http://publications.iarc.fr/Non-Series-Publications/World-Cancer-Reports/World-Cancer-Report-2014

Google Scholar

[2] Cancer. (2018, June 1). World Health Organization. 2018, April 4, Retrieved from http://www.who.int/news-room/fact-sheets/detail/cancer

Google Scholar

[3] 衛生福利部(2018年06月15日)。106年死因統計結果分析。2018年4月4日，取自：https://www.mohw.gov.tw/cp-16-41794-1.html

Google Scholar

[4] 衛生福利部中央健保署(2016年08月31日)。105年各類癌症健保前10大醫療支出統計。2018年4月4日，取自：https://www.nhi.gov.tw/Content_List.aspx?n=AE8F3C1B6EC35217&topn=CDA985A80C0DE710

Google Scholar

[5] What Is Cancer?. (2015, February 9). National Cancer Institute. 2018, April 4, Retrived from https://www.cancer.gov/about-cancer/understanding/what-is-cancer#types

Google Scholar

國際替代計量

應用資料探勘探討淋巴瘤病人特徵之研究

全文下載

主題瀏覽