隨著臺灣醫療水平的進步,人口結構產生了劇烈的變化。我國現階段已是高齡社會(aged society)的一員,老年人口的健康管理成為一個需要被關注的重要課題。慢性疾病(Chronic Disease)對患者的生活品質和長期健康產生極大的影響,因此,慢性疾病風險的預測具有重大意義。其中,糖尿病(Diabetes Mellitus,DM)、心臟病(Heart Disease)、腦中風(Stroke)和高血壓(Hypertension)是臺灣高齡族群中常見的慢性疾病。 罹患慢性疾病必然會造成民眾的經濟負擔,為此保險公司提供多樣保險種類供民眾選擇。然而,民眾需要耗費大量的心力來研究適合自身的保險種類;保險業者也需要耗費人力對投保民眾的健康狀態進行風險評估。有鑑於此,本論文旨在透過深度學習技術,利用病患的個人資訊(Personal Information)以及就醫紀錄(Medical Records)來預測病患罹患糖尿病、心臟病、腦中風和高血壓的風險程度。對於保險業者而言,可以利用模型對不同地區預測出的罹患疾病風險程度,對客戶進行簡單的分群,以加速並制定完善的核保流程,並且業務可以針對不同客戶群推薦適合的保險種類,以實現雙贏效果。 在本研究中,我們將運用衛生福利資料科學中心(Health and Welfare Data Science Center,HWDC)提供的隨機抽樣資料集,其中包含了200萬人的就醫資料,作為我們疾病預測模型的訓練資料。此資料集不僅涵蓋了個人資訊(例如:年齡、性別…..),亦包括病患的就醫紀錄,其中包含大量醫學文獻所提及與糖尿病、心臟病、腦中風和高血壓相關的風險因素(Risk Factor)。 本論文旨在透過多任務學習(Multi-Task Learning)的概念,將原本僅適用於單一疾病預測的點擊率(Click Through Rate,CTR)預測模型以及多模態網路(Multi-Modal Network)模型拓展成可以同時預測多種疾病的多任務學習模型。透過此方法,我們能在降低大量模型參數並節省訓練時間的情況下,讓模型保有一定的預測能力,甚至優於單任務學習(Single-Task Learning)訓練出來的模型性能表現。這樣的結果有助於印證糖尿病、心臟病、腦中風和高血壓之間的直接或間接關聯,並與醫學文獻的看法相一致。 除了降低模型參數和訓練時間的優點外,本研究亦探索了Self-Attention機制中注意力分數(Attention Score)對於就醫紀錄(Medical Records)中疾病之間的解釋性,以發現對於模型預測風險程度有較大影響的高風險疾病或是相關共病症;除此之外,我們還會進一步分析個人資訊(如:年齡、性別……)對模型性能的影響。最終實驗結果與醫學文獻中所陳述之危險因素(Risk Factor)相互印證。
With the advancement of Taiwan's medical technology, there have been drastic changes in the demographic structure. Currently, our country is a member of the aged society, and the health management of elderly population has become an important issue that requires attention. Chronic diseases have a significant impact on patients' quality of life and long-term health. Therefore, the prediction of chronic disease risks holds great significance. Among them, diabetes mellitus (DM), heart disease, stroke, and hypertension are common chronic diseases in Taiwan's elderly population. The occurrence of chronic diseases inevitably leads to financial burdens on the public, prompting insurance companies to offer various types of insurance for people to choose from. However, individuals need to invest a considerable amount of effort in researching the insurance options that suit their needs, while insurance providers need to allocate resources to assess the health risks of insured individuals. In light of this, this paper aims to utilize deep learning techniques to predict the risk levels of patients developing diabetes, heart disease mellitus, stroke, and hypertension using personal information and medical records. For insurance providers, the model can be used to predict disease risk levels in different region, enabling simple customer segmentation to accelerate and refine the underwriting process. Additionally, sales representatives can recommend suitable insurance types to different customer groups, achieving a win-win outcome. In this paper, we will use the dataset provided by Health and Welfare Data Science Center (HWDC), which includes medical records of 2 million individuals, as our training data for the disease prediction model. This dataset not only encompasses personal information (e.g., age, gender, etc.) but also includes patients' medical records, which contain a wealth of risk factors related to diabetes mellitus, heart disease, stroke, and hypertension mentioned in plenty of medical literature. The purpose of this paper is to extend the originally single disease prediction model, such as Click Through Rate (CTR) model and Multi-Modal Network model, to a multi-task learning model capable of simultaneously predicting multiple risks of diseases. By applying the concept of multi-task learning, we can maintain a certain level of predictive ability of the model while reducing a significant number of model parameters and saving training time. In fact, the performance of the multi-task learning model may even surpass the single-task learning model. Such results help validate the direct or indirect correlations among diabetes mellitus, heart disease, stroke, and hypertension, aligning with perspectives found in medical literature. In addition to the advantages of reducing model parameters and training time, this paper also explores the interpretability of Attention Score in the Self-Attention mechanism concerning diseases in medical records. The goal is to discover high-risk diseases or related multimorbidity that have a significant impact on the model's performance. Furthermore, we will analyze the influence of personal information such as age and gender on the model's performance. The ultimate experimental results corroborate the risk factors stated in medical literature.