知識蒸餾法用於低維度輸入的疾病預測

臺灣的高齡人口逐年增加，65歲以上的老年人比起青壯年人更容易罹患上慢性疾病，失智症也是老年人的高風險慢性病之一，失智症會對家庭和社會造成負擔，是一個必須被重視的議題。先前已經有一個性能優良的疾病預測模型被提出，該疾病預測模型採用多模組的架構，將輸入的資料切分成兩部分，一個是個人資料，另一個是就醫紀錄，就醫紀錄包含的資訊有就醫時間和診斷的疾病代碼。該模型將個人資訊和就醫紀錄分別輸入不同的模組分別汲取特徵。由於就醫紀錄內有上萬種的疾病代碼，因此會先通過一個Word2Vec嵌入層（Embedding Layer），用意是將高維的疾病代碼中，讓相關性高的疾病代碼相互群聚在一起，接著通過Attention層，抓取疾病之間的相關性和時間序列特徵，最終模型的表現相當優異。然而，前述模型所需要的就醫紀錄長達10年份，申請使用衛福部的資料集做訓練是可行的，但是實務方面，當模型要落地實行時，要取得個人10年份的就醫紀錄是非常困難的。本研究想要解決測試資料難以取得的問題，因此將資料集的部分從個人資訊和300筆的就醫紀錄，重新處理成為個人資訊和有無罹患特定的139種疾病，並且透過已經完成訓練的疾病預測模型，傳遞它從有就醫紀錄的資料集中學習到的知識，給使用有無罹患139種特定疾病的模型。由於兩種資料集的維度不同但分佈相同，不符合現代遷移學習（Transfer Learning）研究的假設，我們使用了模型壓縮（Model Compression）中知識蒸餾（Knowledge Distillation）的方法，教師模型（Teacher Model）設定為使用300個就醫紀錄的資料集，學生模型（Student Model）設定為使用有無罹患139種疾病的資料集，我們提出的知識蒸餾方法是選取教師模型和學生模型最至關重要的模組（Module）輸出來學習，因此學生模型第一階段是向教師模型學習中間層（Intermediate Layer）的輸出，第二階段中，學生模型會載入第一階段學習的參數向標籤（Label）學習，最後我們成功的將教師模型學習到的疾病相關性和重要性傳遞給學生模型，使得學生模型的訓練結果更甚於相同模型直接使用139種疾病資料集訓練的結果。除了呈現模型的結果之外，我們也討論了未來可能可以讓疾病預測模型更好的方法。首先，因為教師模型可以抓取疾病之間的關聯程度和時間序列的特徵，而目前學生模型並沒有使用的時間序列的輸入，我們可以嘗試將學生模型使用的資料集處理成有罹患特定139種疾病多久的資料集，再去實作我們提出的知識蒸餾方法，讓學生模型向教師模型學習如何抓取時間序列的特徵。第二，我們可以嘗試透過相互學習（Mutual Learning）的方法，讓教師模型和學生模型輪流向對方學習學習不同的觀點，那麼不僅學生模型能從教師模型得到助益，教師模型也有機會進一步提升性能。

關鍵字

失智症；疾病預測模型；遷移學習；知識蒸餾；教師學生模型

並列摘要

The elderly population in Taiwan is increasing year by year. People over the age of 65 are more likely to suffer from chronic diseases than young adults. Dementia is also one of the high-risk chronic diseases for the elderly. It will burden the family and society, which is an issue that must be taken seriously. A disease prediction model with excellent performance has been proposed. It adopts a multi-module structure and divides the input data into two parts. One is personal profile, and the other is medical records, which contain time of visit and disease code of diagnosis. In this model, personal information and medical records are input into different modules to extract features respectively. Since there are tens of thousands of disease codes in the medical records, a Word2Vec embedding layer will be passed first. The purpose is to group high-dimensional disease codes together with highly correlated disease codes, and pass the attention layer to capture the correlation between diseases and time series features. The above-mentioned model requires 10 years of medical records. It is feasible to apply for training using the data set of the Ministry of Health and Welfare. However, in practice, when the model is to be implemented, it is very difficult to obtain 10 years of personal medical records. This research wants to solve the problem, so the part of the dataset is reprocessed from 300 medical records to whether there are 139 specific diseases. We used existing predictive model to transmit what it learned to another model using the 139 diseases dataset. Since the dimensions of the two data sets are different but the distribution is the same, which does not meet the assumptions of modern transfer learning research, we use the method of knowledge distillation in model compression. The teacher model is set to use dataset of 300 medical records, while the student model is set to use the dataset of 139 diseases. The knowledge distillation method we propose is to select the most important module output of the teacher model and the student model, and make the student model learn from the teacher model. The first stage of the student model is to learn the output of the intermediate layer from the teacher model. In the second stage, the student model will load the parameters learned in the first stage to learn from the label. Finally, we successfully transferred the disease relevance and importance learned by the teacher model to the student model, making the training results of the student model better than the results of the same model directly training with the 139 diseases dataset. In addition to presenting the results of the model, we also discuss possible ways to make the disease prediction model better in the future. First of all, because the teacher model can capture the degree of association between diseases and the characteristics of time series, while the current student model does not use the input of time series. We can try to process the dataset used by the student model into a disease-specific 139 species the dataset of how long the disease has been. Then we implement the knowledge distillation method we proposed, so that the student model can learn from the teacher model how to capture the characteristics of the time series. Second, we can try to use the method of mutual learning, let the teacher model and the student model learn from each other from different perspectives. Not only the student model can benefit from the teacher model, the teacher model also has the opportunity to further improve its performance.

並列關鍵字

dementia ； disease predction model ； transfer learning ； knowledge distillation ； teacher-student network

參考文獻

[1] 國家發展委員會。中華民國人口推估（2022年至2070年）。國家發展委員會，2022。

Google Scholar

[2] 歐鎧豪。用於失智症預測的多模態注意力網路。國立臺灣大學資訊工程學研究所碩士學位論文，臺北市，2022。

Google Scholar

[3] David B. Reuben et al., “Chronic disease management: why dementia care is different,” The America Journal of Managed Care, vol. 28, no.12, p. 452, 2022, doi: 10.37765/ajmc.2022.89258AJMC

Google Scholar

[4] E. B. Larson, “Evaluation of cognitive impairment and dementia,” in UpToDate. Waltham, MA., 2019.

Google Scholar

[5] “Global, regional, and national burden of Alzheimer's disease and other dementias, 1990-2016: a systematic analysis for the Global Burden of Disease Study 2016,” (in eng), Lancet Neurol, vol. 18, no. 1, pp. 88-106, 2019, doi: 10.1016/s1474-4422(18)30403-4.

Google Scholar

主題瀏覽