透過您的圖書館登入
IP:3.145.206.169
  • 學位論文

利用深度學習以診斷及用藥歷史預測罹癌風險-以肝癌為例

Use Deep Learning to Accurately Predict Cancer Risk - A Case Study on Hepatocellular Carcinoma

指導教授 : 李友專

摘要


背景:在台灣,由於中央健保的覆蓋率高,以及每人每年的就診次數名列世界前茅,所以我們的健康狀況與我們的就診記錄息息相關。而健保資料庫又提供豐富的資料,我們得以據此而研究個人被診斷過的疾病及開立的藥物與肝癌的關聯。 方法:使用1999~2013年健保資料庫的門診及住院申報檔,從中找出肝癌病患(結果10506人),並取樣非肝癌案例40000人。以ICD-9-CM碼代表疾病診斷結果以及慢性病藥品的使用這兩種資訊當做特徵,以觀察三年資料為例,當觀察期間某天被診斷為某疾病時或被開立某慢性病藥物時,就把個人的特徵-日期二維表上當天設為1分,將每7天或3年期間分數加總,再與所有人在同樣期間及同樣疾病的分數做標準化至0~1後,一方面使用卷積法(CNN)或多層次法(MLP)的類神經網路訓練及驗證肝癌預測準確性;另一方面則以逐步抽取變數觀察準確性損失法,及隨機森林法,和風險係數,去得知各變數的重要性。若要預測提早一年的結果,則需要癌症被標示前一年起算的往前三年資料做訓練。 結果:以觀察3年而提早0.5年到3年的肝癌預測,在三年加總法的AUROC為0.883~0.880;每7天加總並使用CNN則為0.917~0.906。對於疾病做重要因子分析,使用逐步抽取法及隨機森林法的結果類似,重要性高到低前五名為:1.慢性肝病;2.年齡;3.肝炎;4.性別;5.惡性腫瘤掃描。其中惡性腫瘤掃描為負相關,原因為肝癌病人掃描一次後即停止計算,而非肝癌病人可以被掃描多次且被算入。 結論:本研究不只使用了疾病與藥物有或無的資訊,而且也利用了時間資訊,相當於考慮了它們遠近及嚴重程度。它也可視為使用隱藏在個人就醫記錄裡的共病進行癌症預測。由於使用的是大量數據,故不需要專家的知識即可得到預測模型及重要因子。因為使用的資訊是存在資料庫裡現成的資料,所以可以實現低價且快速的初步檢驗。台灣的中央健保提供個人過去三年的就醫診斷結果,保險人可以使用本研究的結果以進行肝癌預測。

關鍵字

癌症 預測 肝癌 深度學習 機器學習 罹癌 風險 歷史 診斷

並列摘要


Background: The National Health Insurance (NHI) agency covers over 99% of the people in Taiwan, which makes its Research Database (NHIRD) a rich data source for predicting hepatocellular carcinoma (HCC) risk and discovering additional risk factors for HCC. Methods: Using clinical data collected between 1999 and 2013 from 2 million randomly sampled people in Taiwan, we found 10,506 HCC patients and randomly sampled 40,000 non-HCC patients to act as the control group. Patients’ ICD-9-CM diagnostic code and medication history (long-term drug) were used to represent their clinical state. We used one hot encoding to indicate the presence or absence of an ICD-9 code or medication code prior to being diagnosed with HCC. As an example, if a patient had three years’ worth of clinical data prior to being diagnosed with HCC (the“index date”), once he was diagnosed with a certain ICD-9 code within those three years, a 1 would be recorded in the Features-Day matrix. After summing up the scores in a period of 7 days, we normalized them with all the patients in the same period and used Convolution Neural Network(CNN) to predict the risk of HCC. To predict N years ahead of time, observation data which is N years before the index date is required. We used 3 methods to discover the important features, including Odds Ratio, Random Forest, and Observation of AUROC loss by stepwise selection. Results: By observing 3 years’ worth of data using varying lead times of 0.5,1,2, and 3 years before the HCC index date, the AUROC of HCC prediction were 0.917~0.906. The most important diseases of HCC revealed by Random Forest and Stepwise ANN were similar. The top five were: 1. Chronic liver disease 2. Age 3. Screening for malignant neoplasms 4. Gender 5. Viral hepatitis. Among these, "Screening for malignant neoplasms” was negatively correlated with HCC because HCC patients stopped being counted after HCC was diagnosed, while non-HCC patients might have continued to be screened. Conclusion: The value of this study lies in the finding that deep learning methods, especially CNNs that incorporate time series information, have the potential to increase our ability to more accurately predict HCC using standardized and widely available clinical data across a broad range of patients. In addition, the study also identified some important risk factors that are highly correlated with HCC. Since NHI provides the latest 3 year clinical claims data to interested parties, this study can be repurposed to predict their HCC risk for clinical applications with minimal difficulty.

並列關鍵字

Cancer predict liver HCC deep learning CNN machine learning diagnosis history

參考文獻


中文文獻
4. 中華民國106年版衛生福利年報, 衛生福利部, Editor. 2018.
23. 吳建昌, 李., 林桂卉,林淑蓉,洪晨碩,陳嘉新,曾凡慈,湯家碩,黃嬡齡,楊添圍,蔡友月, 不正常的人?台灣精神醫學與現代性的治理. 2018: 聯經出版事業公司. 576.
30. 楊軒佳, 系統性探索長期用藥對癌症風險之衝擊, in 生物醫學資訊研究所. 2017, 國立陽明大學. p. 93.
英文文獻

延伸閱讀