透過您的圖書館登入
IP:3.145.108.9
  • 學位論文

使用大規模數據集對脂肪肝疾病的當前訪問和下次訪問預測:模型開發和性能比較

Current-Visit and Next-Visit Prediction for Fatty Liver Disease With a Large-Scale Dataset: Model Development and Performance Comparison

指導教授 : 張智星

摘要


脂肪肝Fatty Liver Disease(FLD)是由脂肪在肝臟中堆積引起的,可能引起肝臟發炎,如果控制不好,可能會發展成為肝纖維化 (liver fibrosis)、肝硬化 (cirrhosis),甚至肝細胞癌 (hepatocellular carcinoma)。基於來自健康檢查中心的多年且大規模數據集,本文提出了脂肪肝疾病 (FLD) 預測的兩項任務,包括當前訪問預測Current-Visit Prediction (CVP)和下次訪問預測Next-Visit Prediction (NVP)。當前訪視預測可用於根據本次訪視時獲得的實驗室檢查(laboratory test)和問卷信息(questionnaire information)預測 FLD 的可能性,而下次訪視預測可用於預測 FLD 發生的可能性。下一次訪問,基於實驗室測試的軌跡和所有過去訪問的問卷信息。在實踐中,NVP 在預防醫學中更有價值,因為如果預測是肯定的,醫生可以向患者建議有效的生活方式改變,以防止下次就診時發生 FLD。據我們所知,這是基於大規模的健康檢查中心之數據集根據在NVP的機器學習的首次嘗試。此外,我們還基於 CVP/NVP 進行了特徵選擇,以在與醫生手動選擇的特徵進行比較時獲得一致的結果。這種多任務預測可以為患者和醫生提供更好和有價值的建議,以實踐預防醫學。 我們描述了機器學習模型的構建用於當前訪問預測(CVP),它可以幫助醫生獲得更多信息以進行準確診斷,以及下次訪問預測(NVP),它可以幫助醫生提供潛在的高風險患者提供有效預防 FLD 的建議。在本研究中使用的大規模高維數據集來自台灣台北市 MJ 健康研究基金會。我們在 FLD 預測中使用一次性排序和順序前向選擇 (SFS) 進行特徵選擇。對於 CVP,我們探索了多種模型,包括 k-最近鄰分類器 (KNNC)、Adaboost、支持向量機 (SVM)、邏輯回歸 (LR)、隨機森林 (RF)、高斯樸素貝葉斯 (GNB)、決策樹 C4 .5 (C4.5),以及分類和回歸樹 (CART)。對於 NVP,我們使用長短期記憶 (LSTM) 及其幾種變體作為使用各種輸入集進行預測的序列分類器。模型性能的評估基於兩個標準:測試集的準確性以及一次性排序/SFS 和領域專家選擇的特徵之間的聯合/覆蓋的交集。分別計算了男性和女性的 CVP 和 NVP 的準確度、精確度、召回率、F1 測量值和接受者操作特徵曲線下的面積。 最後在經過數據清理後,數據集包括 2009-2016 年期間男性和女性的 34,856 次和 31,394 次獨立訪問。使用KNNC、Adaboost、SVM、LR、RF、GNB、C4.5、CART對CVP的測試精度分別為84.28%、83.84%、82.22%、82.21%、76.03%、75.78%、75.53%。 NVP使用LSTM、雙向LSTM(biLSTM)、Stack-LSTM、Stack-biLSTM和Attention-LSTM的測試準確率分別為76.54%、76.66%、77.23%、76.84%和77.31%,固定間隔特徵,以及對於可變間隔特徵,分別為 79.29%、79.12%、79.32%、79.29% 和 78.36%。 本研究探索了一個用於高維的大規模 FLD 數據集。我們為 CVP 和 NVP 開發了 FLD 預測模型。我們還為當前和下次訪問預測實施了有效的特徵選擇方案,以將自動選擇的特徵與專家選擇的特徵進行比較。特別是,從預防醫學的角度來看,NVP 顯得更有價值。對於 NVP,我們建議使用更緊湊和靈活的特徵集 2(具有可變間隔)。我們還結合兩個特徵集測試了 LSTM 的幾種變體,以確定男性和女性 FLD 預測的最佳匹配。更具體地說,男性的最佳模型是使用特徵集 2 的 Stack-LSTM(準確率為 79.32%),而女性的最佳模型是使用特徵集 1 的 LSTM(準確率為 81.90%)。

並列摘要


Fatty liver disease (FLD) arises from the accumulation of fat in the liver and may cause liver inflammation, which, if not well controlled, may develop into liver fibrosis, cirrhosis, or even hepatocellular carcinoma. Based on a large-scale dataset from a health screening clinic, this paper proposes two tasks of prediction for fatty liver diseases (FLD), including current-visit and next-visit predictions. The current-visit prediction (CVP) can be used to predict the likelihood of FLD based on the laboratory test and questionnaire information obtained at the current visit, while the next-visit prediction (NVP) can be used to predict the likelihood of FLD at the next visit, based the trajectory of laboratory test and questionnaire information of all past visits. In practice, NVP is much more valuable in preventive medicine since if the prediction is positive, the physician can suggest effective lifestyle changes to the patient in order to prevent the occurrence of FLD in the next visit. As far as we know, this is the first attempt at machine learning for NVP based on a large-scale health screening dataset. Moreover, we have also performed feature selection based on CVP/NVP to achieve consistent results when comparing to the features manually selected by the physician. Such multitask prediction can give a much better and valuable suggestion for both the patient and the physician in order to practice preventive medicine. We describe the construction of machine-learning models for current-visit prediction (CVP), which can help physicians obtain more information for accurate diagnosis, and next-visit prediction (NVP), which can help physicians provide potential high-risk patients with advice to effectively prevent FLD. The large-scale and high-dimensional dataset used in this study comes from Taipei MJ Health Research Foundation in Taiwan. We used one-pass ranking and sequential forward selection (SFS) for feature selection in FLD prediction. For CVP, we explored multiple models, including k-nearest-neighbor classifier (KNNC), Adaboost, support vector machine (SVM), logistic regression (LR), random forest (RF), Gaussian nave Bayes (GNB), decision trees C4.5 (C4.5), and classification and regression trees (CART). For NVP, we used long short-term memory (LSTM) and several of its variants as sequence classifiers that use various input sets for prediction. Model performance was evaluated based on two criteria: the accuracy of the test set and the intersection over union/coverage between the features selected by one-pass ranking/SFS and by domain experts. The accuracy, precision, recall, F-measure, and area under the receiver operating characteristic curve were calculated for both CVP and NVP for males and females, respectively. After data cleaning, the dataset included 34,856 and 31,394 unique visits respectively for males and females for the period 2009-2016. The test accuracy of CVP using KNNC, Adaboost, SVM, LR, RF, GNB, C4.5, and CART was respectively 84.28%, 83.84%, 82.22%, 82.21%, 76.03%, 75.78%, and 75.53%. The test accuracy of NVP using LSTM, bidirectional LSTM (biLSTM), Stack-LSTM, Stack-biLSTM, and Attention-LSTM was respectively 76.54%, 76.66%, 77.23%, 76.84%, and 77.31% for fixed-interval features, and was 79.29%, 79.12%, 79.32%, 79.29%, and 78.36%, respectively, for variable-interval features. This study explored a large-scale FLD dataset with high dimensionality. We developed FLD prediction models for CVP and NVP. We also implemented efficient feature selection schemes for current- and next-visit prediction to compare the automatically selected features with expert-selected features. In particular, NVP emerged as more valuable from the viewpoint of preventive medicine. For NVP, we propose the use of feature set 2 (with variable intervals), which is more compact and flexible. We have also tested several variants of LSTM in combination with two feature sets to identify the best match for male and female FLD prediction. More specifically, the best model for males was Stack-LSTM using feature set 2 (with 79.32% accuracy), whereas the best model for females was LSTM using feature set 1 (with 81.90% accuracy).

參考文獻


1. Rajabi Shishvan O, Zois D, Soyata T. Machine intelligence in healthcare and medical cyber physical systems: a survey. IEEE Access 2018;6:46419-46494. [doi: 10.1109/access.2018.2866049]
2. Fan J, Kim S, Wong VW. New trends on obesity and NAFLD in Asia. J Hepatol 2017 Oct;67(4):862-873. [doi: 10.1016/j.jhep.2017.06.003] [Medline: 28642059]
3. Hsu C, Kao J. Non-alcoholic fatty liver disease: an emerging liver disease in Taiwan. J Formos Med Assoc 2012 Oct;111(10):527-535. [doi: 10.1016/j.jfma.2012.07.002] [Medline: 23089687]
4. G. Szabo, P. Mandrekar Focus on: alcohol and the liver Alcohol Res Health, 33, (2010):87-96
5. Verónica Martín-Domínguez, R. G.-C., J. Mendoza-Jiménez-Ridruejo, Luisa García-Buey, Ricardo Moreno-Otero Pathogenesis, diagnosis and treatment of non-alcoholic fatty liver disease Rev. Esp. Enferm. Dig., 105 (2013):409-420. [doi: 10.1016/j.atherosclerosis.2015.01.001]

延伸閱讀