透過您的圖書館登入
IP:3.141.152.173
  • 學位論文

基於大數據分析之有效疾病風險預測系統

An Effective Disease Risk Prediction System Using Big Data Analytics

指導教授 : 王國禎

摘要


大數據分析是一個將潛在可能的資訊從海量資料中解析出來的程序。在這些不同型態的資料中,可能含有隱藏的固有模式、未知的資料關聯性、市場趨勢、顧客偏好,以及其他有用的商業資訊。醫療產業是一個資料量豐富的行業,而協助病患預防潛在可能罹患的疾病是一個重要的議題。相關研究使用資料探勘技術,如決策樹、集群和關聯規則,以及推薦系統,如使用者導向的協同過濾,來幫助使用者預防罹患潛在可能的疾病。然而,在使用決策樹時,微小輸入資料更動會造成差異甚大的輸出資料特性,使其在分析大量資料時有極低的準確性。而對於某些集群演算法來說,如K-means,在使用時需要能夠先知道有幾個集群才能正常運作。倘若集群之間擁有不同的大小或密度,則該演算法將不能有效的運作。關於關聯規則,資料之間通常都必須有關聯性,因此在分析大數據時並不能每次都有效運作。使用者導向的協同過濾如CARE,在病患的疾病中若擁有大量的變異性,則其準確度不佳。CFIAC為協同過濾相關代表研究,它使用項目導向協同過濾。此篇論文由於未對資料進行前置處理,因此在處理稀疏型數據時,無法進行有效預測。在本篇論文中,我們提出一個結合了基於分配之集群方法和項目導向協同過濾方法來有效預測潛在疾病風險的系統(EDRP)。此系統能夠有效分析大量數據並準確預測病患的潛在疾病。實驗結果顯示EDRP比CARE及CFAIC分別提高了15.32%及24.08%的覆蓋率,其準確度也分別比CARE及CFAIC提高了19.56%及32.76%。

並列摘要


Big data analytics is the process of examining large data sets that contain a variety of data types to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information. The healthcare industry is generally data rich. How to predict a new patient’s disease risk from patients’ history data is an important research issue. Related studies used data mining techniques, such as decision tree, clustering and association rules, and recommender systems, such as user-based collaborative filtering (CF), to predict future disease risk of new patients. However, the decision tree suffers from a small change in input data resulting in a large change in the tree which gives poor accuracy when applying to large data sets. For clustering like k-means, it requires to know number of clusters in advance and it does not work well with clusters of different sizes and different densities. So it is difficult to predict the future disease risk for large data sets. For the association rules, the data set used needs to have a relationship between data, so it may be not applicable for all data sets. For the user-based CF, like CARE, if there is a large variation in patients’ diseases, it results in poor accuracy. A representative related work, CFIAC, which is based on the item-based CF, cannot deal with the sparsity problem as it didn’t use any pre-processing method to remove data that have less contribution in making prediction. In this thesis, we propose an effective disease risk prediction system (EDRP) that combines distribution-based clustering with item-based CF. The system is feasible for large data sets and it can perform well at capturing future disease risk for new patients. Experiment results show that the proposed EDRP increases coverage by 15.32% and 24.08% and accuracy by 19.56% and 32.76%, compared to CARE and CFAIC, respectively

參考文獻


[2] Ji, Xiang, et al. "Collaborative and trajectory prediction models of medical conditions by mining patients' Social Data." Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on. IEEE, 2015.
[3] Vucetic, Slobodan, and Zoran Obradovic. "Collaborative filtering using a regression-based approach." Knowledge and Information Systems 7.1 (2005): 1-22.
[5] Davis, Darcy A., et al. "Time to CARE: a collaborative engine for practical disease prediction." Data Mining and Knowledge Discovery 20.3 (2010): 388-415..
[6] Jothi, S., and S. Anita. "Data Mining Classification Techniques Applied For Cancer Disease–A Case Study Using Xlminer." International Journal of Engineering Research and Technology. Vol. 1. No. 8 (October-2012). ESRSA Publications, 2012
[7] Palaniappan, Sellappan, and Rafiah Awang. "Intelligent heart disease prediction system using data mining techniques." 2008 IEEE/ACS International Conference on Computer Systems and Applications. IEEE, 2008.

延伸閱讀