透過您的圖書館登入
IP:18.217.116.183
  • 學位論文

應用機器學習搭配資料庫註解預測突變點之致病性

Predicting pathogenicity of variants using machine learning with database annotation

指導教授 : 歐陽彥正
共同指導教授 : 陳倩瑜(Chien-Yu Chen)
本文將於2026/01/26開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


隨著次世代定序(Next-Generation Sequencing, NGS)的技術成熟、成本下降,基因檢測在臨床的應用也迅速增長,而基因檢測的實用性取決於是否能針對變異點精準地進行致病性分類。為了解決此問題,2015年提出的ACMG指引,依據不同疾病調整判斷準則與計分方式,將變異點分為致病性、可能致病性、不確定性、可能良性及良性五種類別,根據ACMG指引可將大多準則以ANNOVAR等變異點註解工具得到判讀所需的資料庫資訊,再加以根據疾病調整ACMG準則參數來達到自動化判讀。這幾年間隨著大量定序資料的累積,隨之而來的是大量的變異點的標註,在ClinVar資料庫(統計至2020年)累積了近80幾萬個與疾病表現型相關的點位,其中確定的臨床證據有限或可能互相矛盾,且經過ClinGen專家審核的只有約12,000個點位。近來有工具如LEAP模擬專家根據ACMG指引判讀行為,以資料庫的註解,如:不同致病性預測分數及等位基因頻率等,作為隨機森林的特徵,其中沒有使用單純演化保留而來的致病性預測分數(如CADD),也沒有基於序列染色絲調控之致病性預測分數(如DeepSEA)作為模型特徵。本論文開發有別於ANNOVAR等方法之變異點註解系統,以ClinVar專家審核過的資料集作為訓練資料,本研究以BRCA1/2兩個基因為例,加入CADD、SpliceAI等相關完整致病性預測資料庫,並使用DeepSEA模型對變異點預測的分數作為特徵,以此來訓練V-score隨機森林模型,輸出變異點位致病性的機率,並以ClinVar重新分類 (Reclassification)資料集作為測試資料,準確率可達98.48%,可有效降低不確定性率 (VUS rate),以AUC與其他22種預測方法比較也可達98%,最後針對變異點判讀的信心度門檻篩選作討論。相信此研究成果與所累積的經驗,將對突變點致病性的評估有很大的幫助。

並列摘要


With the advance of Next-Generation Sequencing (NGS), the cost of sequencing dramatically reduced. In the meantime, the applications of clinical genetic tests greatly increased, and the practicality of genetic tests depends on accurate pathogenicity classification for variants. To address this problem, ACMG guideline was proposed in 2015 to provide criteria and scoring functions according to different diseases by summing up the evidences. In the guideline, the variations can be divided into Pathogenic, Likely pathogenic, Variant of uncertain significance (VUS), Likely benign, and Benign. According to ACMG guideline, we can use variant annotation tools such as ANNOVAR to obtain needed features from bioinformatic databases and tune the parameters of ACMG criteria for ACMG automated interpretation. In the past few years, a large amount of sequencing data was produced, leading to a tremendous increase in the discovery of novel variants. According to the statistics of the ClinVar database, nearly 800,000 variants have accumulated by 2020. Among them, the confirmed clinical evidence is limited or may contradict each other and only about 12,000 variants were reviewed by the ClinGen expert panel. Recently there are tools such as LEAP that simulate expert behaviors to interpret variant classification based on the ACMG guideline. LEAP uses database annotations such as different scores from pathogenic predictors or populational allele frequencies as features to train the prediction models. It is worth mentioning that LEAP doesn’t leverage pathogenic predictors which make use of conservation data (e.g. CADD) pathogenic scores trained on regulation of sequences (e.g. DeepSEA). In this regard, this study develops a variant annotation system and takes BRCA1/2 in ClinVar database as an example to evaluate the system. The proposed system enhances annotation efficiency and integrates predictors including CADD, SpliceAI and pathogenic scores delivered by other predictors, such as the chromatin effect score from DeepSEA, as features to train a random forest model: V-score, to output pathogenic probability. This thesis uses a reclassification dataset from ClinVar to investigate the performance of reducing the VUS rate. The V-score model achieved an accuracy of 98.48% and an AUC of 98%, better than 22 other functional prediction methods. Finally, this thesis uses the reclassification dataset as an example to discuss the potential of the proposed system pathogenic probability thresholds for variant classification. It is believed that the V-score model and the accumulated experiences of this study can greatly benefit the future research of variant classification.

參考文獻


1. Rehm, H., Bale, S., Bayrak-Toydemir, P. et al. ACMG clinical laboratory standards for next-generation sequencing. Genet Med. (2013).
2. Bunnik, Evelien M, and Karine G Le Roch. An Introduction to Functional Genomics and Systems Biology. (2013).
3. Heng Li, Richard Durbin, Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics (2009).
4. Li, Heng et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics (2009).
5. The 1000 Genomes Project Consortium., Corresponding authors., Auton, A. et al. A global reference for human genetic variation. Nature (2015).

延伸閱讀