透過您的圖書館登入
IP:18.191.46.36
  • 學位論文

以機器學習方法預測拷貝數變異之致病性

Using machine learning methods to predict pathogenicity of copy number variation

指導教授 : 賴飛羆
共同指導教授 : 李妮鍾(Ni-Chung Lee)
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


次世代基因定序技術近年來的普及使得人類基因體之快速定序變得可行。同時,受益於較低的定序耗材成本,使用基因定序資料於病人的臨床醫學診斷,尤其是針對遺傳疾病方面的診斷已不再遙不可及。病人體內的DNA經由定序儀器轉換成序列資料,並借助幾個生物資訊演算法依序運算後,可以提供外顯子中拷貝數變異 (Copy Number Variation, CNVs) 的位置、拷貝數等資訊。傳統上,醫師須以人工手段逐一針對許多不同位置上的拷貝數變異在複數基因資料庫查詢相關報告,並在瀏覽巨量的資料後,方能判讀出具致病性且符合病人的病徵的少數幾個拷貝數變異。如此繁複的流程常使得醫師須耗費大量的時間與精力於遺傳疾病的判讀,從而在繁重的臨床業務上再增添了一筆負擔。 這篇研究的目標是利用機器學習方法來建立拷貝數變異之致病性預測模型以應用於病人的外顯子資料,從而輔助醫生更快速與準確地判讀藉由次世代定序技術產生的拷貝數變異資料。研究者取得公開之ClinVar資料庫中帶有致病性判讀與症狀描述的拷貝數變異資料,結合公開之國際基因體樣本資源 (The International Genome Sample Resource, IGSR) 中健康人體帶有之拷貝數變異資料以作為機器學習模型的訓練資料。研究者利用多個基因資料庫對拷貝數變異提供之註解,與病徵關鍵字和拷貝數變異之間的關聯性作為機器學習模型之資料特徵。完成訓練之模型會預測拷貝數變異之致病性,並按照拷貝數變異的致病性由高至低排序後輸出。 用以測試的全外顯子定序資料資料來自台大醫院28位患有與拷貝數變異相關之先天遺傳性罕見疾病的病人。研究者也取得醫生註記之病徵關鍵字以計算病徵與各個拷貝數變異之關聯性。28位病人有一共31個致病拷貝數變異,而每一位病人平均約有452個候選的拷貝數變異。訓練完成的預測模型分別將每位病人的所有拷貝數變異排序後,成功將83.9%的致病性拷貝數變異排序在前十名。結果顯示此模型可以輔助醫師與研究人員,從而有效地縮短診斷時間以及早治療拷貝數變異相關疾病。

並列摘要


During the last decade, fast sequencing of a whole human genome became realizable as a consequence of widespread application of next-generation sequencing (NGS) technology. Meanwhile, lower cost of consumables of genome sequencing also makes rise to wide usage of NGS technology in clinical diagnoses, especially those made on patients suffering from hereditary disorders. The DNA sequencing data of a patient can be retrieved with multiple specialized tools, machines and bioinformatics pipelines to later generate exome data of copy number variations (CNV), like their positions and copy numbers. Typically, in order to find out the just one or few causal CNVs that lead to genetic disease on a single patient, physicians have to manually view various metadata of CNVs carried by this patient and look into different genetic databases to know the influence of these CNVs. This time-consuming process incurs heavy workload and thus becomes a burden in clinical practice. To assist physicians with fast interpretation of CNV information gathered from NGS results, we tried to train a machine learning model to predict the pathogenicity of CNVs in exome data. We collected CNV data accompanied with pathogenicity annotation and phenotype description from open database ClinVar, and later collected data of CNV on healthy people from open database, The International Genome Sample Resource (IGSR). The data from ClinVar and IGSR were later mixed together to be the training data of our model, called AI CNV Prioritizer. We gathered annotations of these CNVs from several gene databases, and correlation score between CNV and phenotype as model features. The built model will predict the pathogenicity score of CNVs, which can later be used to sort the most likely causative CNVs by its predicted score. The testing data of the model are the whole exome sequencing data collected from 28 patients admitted to National Taiwan University Hospital (NTUH). We used phenotype keywords provided by doctors to calculate the correlation score between phenotype and CNVs. There are 31 causative CNVs in the data of these 28 patients, and each patient has 452 candidate CNVs on average. The model we trained succeeded in locating 83.9% of the causative CNVs in the top-10 ranked list. The model is hence able to help diagnose genetic diseases, leading to earlier treatment for CNV-related phenotypes.

參考文獻


[1] Ada Hamosh, Alan F. Scott, Joanna Amberger, David Valle, and Victor A. McKusick. Online Mendelian Inheritance in Man (OMIM). Human Mutation, 15(1):57–61, 2000.
[2] Margaret P Adam, Holly H Ardinger, Roberta A Pagon, Stephanie E Wallace, Lora J H Bean, Karen Stephens, and Anne Amemiya, editors. GeneReviews(®). 1993.
[3] Joel. Faintuch and Salomão. Faintuch. Precision medicine for investigators, practitioners and providers. Academic Press, 2020.
[4] Ching Hsu. An Integrated Genetic Variation Analysis System for Gene Diagnostics in Precision Medicine. Master’s thesis, National Taiwan University, 2018.
[5] Ting-¬Fu Chen. Variants Prioritizer for Exome Data Based on Text¬-mining. Master’s thesis, National Taiwan University, 2018.

延伸閱讀