透過您的圖書館登入
IP:18.188.120.159
  • 學位論文

基於語義方法和自監督式學習之病例與試驗配對

Towards Semantic Matching and Self-supervised Learning for Patient-trial Matching

指導教授 : 許永真

摘要


建構一個自動化的病例與試驗配對(Patient-Trial Matching)系統,提升病例與試驗配對效率,可以幫助醫療人員節省寶貴時間,更快地取得合適的臨床試驗資訊,作為醫療方案的參考選項。 病例與試驗配對是基於病人狀況(病例)搜尋出符合資格條件(Eligibility Criteria)的臨床試驗。由於資格條件皆是以非結構化的自然語言形式表示,並且包含了臨床醫學知識的複雜語義,過去相關研究大多將其視為資訊自動抽取問題,並以基於規則的方式來解決,近來則以機器學習或神經網路模型來處理。然而,這些方法都需要花費龐大人力或取得大量人工標注資料才能進行。 我們提出一個迭代式工作方法(Iterative Approach)來進行病例與試驗配對。在缺乏標註資料的前期研究階段,以語義相似度進行句級配對(Matching in Sentence Level),並以此作為核心模組,建構一個包含文字預處理、文本向量表示、句對匹配、反義偵測、數值過濾器、整合排序等模組的完整配對流程。取得的配對結果皆交由醫學專家進行人工評量與標註,我們從標註結果的分析中發現,語義相似等於語義相關乃至相符的假設並不完全適用於本研究的文本特性。因此,在後期研究中嘗試以自監督式學習(Self-supervised Learning)的方式解決該問題。 我們將標註資料、病例資料、試驗條件三者進行關聯,針對特定的臨床特徵詞產生虛擬標註資料集,用來訓練語義正反(相符與否)的分類模型,並將分類器導回整體工作流程中。利用上一輪的配對結果進行『標註-生成-訓練』三個步驟產生分類器,再產生下一輪的配對結果,如此反覆迭代,即可逐漸提升整體的配對效能。 我們依序在三批病例上執行兩輪此迭代式工作流程,結果是,導入分類器後皆增加了15%精確率,提升了整體流程的配對效能,驗證了本研究的迭代方法和配對流程是可行而且有效的。

並列摘要


We have developed an automated patient-trial matching system to improve the efficiency of clinical trial matching. The intelligent matching system can help medi- cal staff save valuable time and quickly obtain appropriate clinical trial information as a reference for treatment plans. The patient-trial matching system finds clinical trials that meet the eligibility criteria based on the patient’s condition (medical records). The eligibility criteria are all expressed in natural language and contain the complex semantics of clinical medical knowledge. In the past, most related work defined it as an automatic information extraction problem and solved it with rule-based methods or machine learning methods. However, these methods require the efforts of many experts to manually annotate the data. We propose an iterative method to match patient cases and clinical trials. In the first phase of the research, due to the lack of annotation data, we used semantic- based methods to match at the sentence level. We build the matching system which includes these modules for tasks such as text preprocessing, text vector representa- tion, sentence pair matching, negative detection, numerical filters, and aggregation. The matching results are manually evaluated and annotated by medical experts. Based on the analysis of the annotation results, we found that the assumption that semantic similarity is equal to semantic relevance does not apply to our research. Therefore, in the second research phase, we use self-supervised learning methods to solve this problem. We associate clinical trial annotation data, medical case data, and standard data to generate pseudo-annotated datasets for specific clinical features to train semantic classification models and import the classifiers to our workflow. Using the matching results of the previous round, the classifier is generated through the three steps of ”labeling-generating-training”, and then the matching results of the next round are generated. Such repeated iterations can gradually improve the overall matching performance. We performed two rounds of this iterative workflow on three batches of patient cases in sequence. As a result, after importing the classifier, the precision increased by 15%, and the matching efficiency of the whole process was improved, which verified the feasibility and effectiveness of the iterative method and matching process of this research.

參考文獻


[1] A. Bustos and A. Pertusa. Learning eligibility in cancer clinical trials using deep neural networks. 2018.
[2] W. W. Chapmana, W. Bridewell, P. Hanburya, G. F. CooperabBruce, and G. Buchanana. A simple algorithm for identifying negated findings and diseases in discharge summaries. 2001.
[3] Q. Chen, Y. Peng, and Z. Lu. Biosentvec: creating sentence embeddings for biomed- ical texts. 2019.
[4] J. Gao, C. Xiao, L. M. Glass, and J. Sun. Compose: Cross-modal pseudo-siamese network for patient trial matching. 2020.
[5] T. Kang, S. Zhang, Y. Tang, G. W. Hruby, A. Rusanov, N. Elhadad, and C. Weng. Eliie: An open-source information extraction system for clinical trial eligibility crite- ria. 2017.

延伸閱讀