自動化巨量醫學資料知識探勘

本研究之主要目的為提供研究人員在對醫學巨量資料庫進行研究前，一個對資料庫進行自動化知識探勘之工具，並以視覺化工具作為輔助，提供研究人員快速了解此醫學資料庫中所包含的知識。一般利用巨量醫學資料進行研究時，往往只針對單一主題而只使用小部分之資料，並沒有利用到巨量資料的優勢。本研究以疾病對照研究為例，利用疾病對照研究中log-rank test等統計方法，自動化地分析巨量就醫紀錄。自動化分析完成後，將結果以數種視覺化工具呈現於網頁端，提供研究人員瀏覽檢視此巨量醫學資料中的知識。此研究以C型肝炎與干擾素之關係為例，挖掘出在將近1300用藥病患與14000無用藥病患間，他們未來所被診斷出所有疾病之疾病研究對照結果以及對應的Kaplan-Meier曲線圖與log-rank檢驗之P值，並以Sankey Diagram與Treemap等視覺化工具提供研究人員完整的知識瀏覽系統，能夠讓研究人員在進行研究之前，對此巨量資料中包含知識有更深的了解。因歷史性的醫學巨量資料可能需要經過許多預處理過程，分散式運算技術在挖掘巨量醫學資料中扮演重要的角色。本研究未來展望包括將不同醫學統計研究方法與資料探勘演算法套用到本自動化分析系統中，與建立知識資料庫統合結果進行二次分析等利用。

關鍵字

醫學巨量資料；醫學知識挖掘；疾病對照研究；資料視覺化；醫學資料探勘

並列摘要

The purpose of this research is to provide an automated knowledge discovery tool for researchers to mine knowledge from medical big data before they use this dataset to do research. And provide visualization interface to researchers, so they can easily browse features in this dataset and knowledge we mined by previous step. When researchers use medical big data to find medical knowledge, they usually focus on small topic and use small subset of data to discover potential finding, doesn’t take advantage of big data analysis. This research use case control research design as example, with statistic methods like log rank test, automatically analyze big medical record data. After automated analysis process, system present result by several kinds of visualizing tool on web client, provide easy interface to validate and browse result and potential knowledge. This research use relationship of hepatitis C and interferon as an example, discover case control result over every related diseases of 1300 hepatitis C patients treated with interferon and 14000 hepatitis C patients without being treated with interferon, include p-value from log rank test and Kaplan-Meier curve, also, visualizing with Sankey Diagram, Treemap .etc, provide complete knowledge browsing system to researchers, let them can gain more understanding of this big dataset before they dive into it. We may need to do lots of preprocess before analyzing historical big medical dataset, so distributed computing play an important role in mining this kind of dataset. Future work includes apply different kinds of statistic method or data mining algorithm in our automated analyzing process, and create intermediate knowledge database for further analysis and future integration.

並列關鍵字

Medical big data ； Medical knowledge discovery ； Data mining ； Data visualization ； Case control study

參考文獻

1. Outflow: Visualizing Patient Flow by Symptoms and Outcome Krist Wongsuphasawat, and David H. Gotz

2. Krist Wongsuphasawat and David Gotz. Exploring Flow, Factors,

and Outocomes of Temporal Event Sequences with the Outflow

DICON: Interactive Visual Analysis of Multidimensional

6. Miao-Ching Chi, Y.-L.H., Yu-Chun Wang, The Effect of Ambient Air Quality on Respiratory Diseases in Taiwan. 2010.

國際替代計量

自動化巨量醫學資料知識探勘

全文下載

主題瀏覽