透過您的圖書館登入
IP:52.14.130.13
  • 學位論文

從臨床文字報告到復發預測模組 以肝癌患者為研究對象進行資訊擷取、資料查詢與探勘

From Clinical Narrative Reports to Recurrence Predictive Models: Information Extraction, Data Query, and Data Mining for Liver Cancer Patients

指導教授 : 賴飛羆
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


根據2011年全球癌症統計調查指出,肝癌為第二大男性癌症死亡原因,為第六大女性癌症死亡原因。這個研究的主要目標是以臨床文字報告與結構化臨床資料為基礎,針對接受射頻燒灼術治療的肝癌患者,進行治療後復發預測模組的建立。臨床文字報告隨著電子病歷系統的普及引用而累積於臨床資料庫中。如何使醫療相關人員能夠有效率地追蹤患者長時間所累積的各種不同類型文字報告、如何輔助醫療人員查詢累積於臨床資料庫的大量已結構化的資料皆成為促進臨床研究進行的重要議題。而醫療數值的分析研究中,常需要面臨資料缺值 (missing values)的問題,在此預測肝癌患者治療後復發的模組發展研究過程中,亦有資料缺值的問題,需要作進一步的探討。 本研究發展了相關的資料擷取方法,將肝癌相關的臨床因子由各種不同類型文字報告中進行資訊擷取,而這些文字報告包含了超音波報告、放射科影像報告、病理報告、手術記錄、入院病摘及出院病摘等等的文字報告。資訊擷取的模組可被應用於追蹤肝癌患者的病情變化,而以規則為主的分類器則可被應用於判別患者是否符合研究相關條件。此研究所發展的分類器提供患者是否符合特定條件之解答以及直接與間接分類的依據證明 (即從文字報告抽取出的相關陳述句)。針對輔助醫療人員查詢結構化資料的議題,本研究採用以本體論為導向的臨床指引表示語言 (ontology-driven clinical guideline representation language),來對查詢的工作進行結構化的描述。而這些查詢工作主要是在Protege的環境中透過GLIF3.5的相關元件,將查詢工作分為多個步驟與流程來進行結構化的表示,本研究實作可讀取GLIF3.5元件所表示的查詢工作之執行引擎,以針對對應的資料進行查詢與結果顯示之動作。透過不同數量患者的模擬實驗,以分析執行引擎資料查詢的正確性與所需之執行時間。為了在含有不完整資料的情形下發展預測模組,本研究針對補值的方法,在模擬的實驗中,共採用七個補值方法 (imputation method) 以分析、評估不同補值方法在此研究中所展現的穩定性以及正確性,使用這些較能較佳的設定,針對不同整合資料取樣時間所產生的資料組進行補值。並以支持向量機 (support vector machines) 為分類器來針對不同資料組來建立復發預測模組,以評估這些不同補值方法對於預測模組的影響。本研究介紹了一個兩階段的評估方法,來針對補值的方法和預測模組進行評估,在第一個階段中,針對補值方法的效能進行評估,而在第二個階段中,則是針對補值方法對預測模組的影響進行評估。 資料抽取模組的F-score為92.40% 至 99.59%,而判別患者條件分類器的F-score為96.15% 至 100%。在使用GLIF3.5語言為基礎來實作查詢引擎的部份,查詢引擎在不同患者數量的模擬實驗中,可正確地將對應資料取回呈現,而透過GLIF3.5的採用,可增進未來查詢工作互通性 (interoperability),共享性 (shareability) 以及重用性 (reusability) 之潛力。透過兩階段的評估方法,在第一個階段中,針對補值方法的效能進行評估,針對每一個補值方法,選出了較具穩定性以及正確性的參數設定,而在第二個階段中,呈現出能達到效能較好的預測模組所使用的補值方法和整合資料取樣時間的組合,實驗結果顯示,這些表現較好的補值方法和一般常用的補值方法 (平均值補值) 相比較,達到顯著的差異 (P < 0.001)。

並列摘要


Background: According to the global cancer statistics in 2011, liver cancer in men was the second most frequent cause of cancer death and was the sixth leading cause of cancer death in women. The goal of this work is to develop the recurrence predictive model for patients who have received radiofrequency ablation (RFA) treatments based on the clinical narrative reports and structured data source. Introduction: As a result of the increasing adoption of electronic medical record (EMR) systems, more and more medical records are accumulated in the clinical data repository. To provide an efficient way for tracking patients’ conditions with long period of time and facilitate the collection of clinical data from different types of narrative reports, it is critical to develop an efficient method for smoothly analyzing the clinical data accumulated in narrative reports. For structured data, querying the data stored in the clinical data repository becomes increasingly important for discovering the contained knowledge from enormous data. In medical research, the problem of missing data occurs frequently. In this work of developing a liver cancer recurrence predictive model, there are still missing data. Therefore, the adoption of methods for dealing with missing data is necessary. In this study, several imputation methods and their effects on different multiple measurement data sets with different sampling time periods are compared. Materials and Methods: To facilitate the liver cancer clinical research, a method is developed for extracting clinical factors from the mixture types of narrative clinical reports, including ultrasound reports, radiology reports, pathology reports, operation notes, admission notes, and discharge summaries. An information extraction (IE) module is developed for tracking a liver cancer patient’s disease progression over time, and a rule-based classifier is developed for answering whether patients meet the clinical research eligibility criteria. The classifier provides answers and direct/indirect evidences (i.e., evidence sentences) for the clinical questions. The ontology-driven clinical guideline representation language, guideline interchange format version 3.5 (GLIF3.5), is utilized for formulating query tasks. The query tasks are formulated based on the flowchart of GLIF3.5 in the environment provided by Protege. The query execution engine in Flowchart-Based Data-Querying Model (FBDQM) is developed and implemented for executing the query tasks and presenting the query results in the visualized graphical interface. The correctness and the in-time performance of the system are evaluated using three medical query tasks relevant to liver cancer in the experiments. To develop predictive models based on the incomplete clinical data, several imputation methods are adopted for dealing with the missing data before the process of data analysis. Support vector machine (SVM) is employed in building the recurrence predictive model. This study introduces a two-level method for evaluating imputation methods and predictive models when the problem of missing data occurs. The first level of this method is used for evaluating the performance of an imputation method, and the second level is used for evaluating the influence of imputation methods on predictive models. Results: The IE model achieves F-score from 92.40% to 99.59% and the classifier achieves accuracy from 96.15% to 100%. The FBDQM-based query execution engine performs successfully to retrieve the clinical data based on the query tasks formatted using GLIF3.5 in the experiments with different amounts of patients. The correctness of the three query tasks is 100% in four experiments. For each imputation method, more appropriate parameter settings for a specific data set can be selected based on the imputation simulation experiment. The results reveal that appropriate combinations of imputation methods and sampling time periods could achieve better classification results than those of other imputation methods and periods. According to the evaluation results, the leading imputation methods are significantly different (P < 0.001) from the mean imputation which is frequently used by data sets with missing values. Conclusions: The application is successfully applied to the mixture types of narrative clinical reports. It might be applied to the key extraction for other types of cancer patients. The ontology-driven and FBDQM-based approach enriches the capability of data query. The adoption of GLIF3.5 increases the potential for interoperability, shareability, and reusability of the query tasks. According to a two-level evaluation approach, imputation methods and their effects on different multiple measurement data sets for the classification of liver cancer recurrence can be explored.

參考文獻


[1] R. Capocaccia, et al., "Hepatocellular carcinoma: trends of incidence and survival in Europe and the United States at the end of the 20th century," Am J Gastroenterol, vol. 102, pp. 1661-70; quiz 1660, 1671, Aug 2007.
[2] M. C. Yu and J. M. Yuan, "Environmental factors and risk for hepatocellular carcinoma," Gastroenterology, vol. 127, pp. S72-S78, Nov 2004.
[3] L. T. Lee, et al., "Age-period-cohort analysis of hepatocellular carcinoma mortality in Taiwan, 1976-2005," Ann Epidemiol, vol. 19, pp. 323-8, May 2009.
[5] J. D. Yang and L. R. Roberts, "Hepatocellular carcinoma: a global view," Nature Reviews Gastroenterology & Hepatology, vol. 7, pp. 448-458, Aug 2010.
[6] H. B. El-Serag, et al., "Diagnosis and treatment of hepatocellular carcinoma," Gastroenterology, vol. 134, pp. 1752-63, May 2008.

延伸閱讀