透過您的圖書館登入
IP:3.147.76.135
  • 學位論文

評估 ezGeno 在分析轉錄因子結合特徵之表現並應用於跨細胞株的比較研究

Evaluating the performance of ezGeno in analyzing transcription factor binding profiles and applying it to a comparative study across cell types

指導教授 : 陳倩瑜

摘要


在基因表現的相關研究議題中,轉錄因子及其結合位的交互作用關係一直受到很大的關注,一直以來,轉錄因子如何辨識基因體中特定結合位置並與之結合,進而調控後續基因表現,最終影響生物行為,是生物資訊學者想了解的重要問題。本論文著重於研究轉錄因子於不同細胞株間的結合位差異,藉由收集不同細胞株中,數個轉錄因子之染色質免疫沉澱定序資料,透過深度學習工具進行分析,但由於對不同轉錄因子而言,所適合的深度網路模型並不相同,因此本研究使用自動機器學習工具ezGeno,加速建立不同轉錄因子在不同細胞株之預測模型,並將訓練後之模型應用於尋找可能影響轉錄因子結合之變異位點。 本論文使用本實驗室與台灣人工智慧實驗室合作開發之ezGeno,該工具先以自動機器學習的方式去挑選適合的卷積神經網路模型後,再進行轉錄因子結合位的預測。本研究主要使用ENCODE資料庫中的染色體免疫沉澱定序資料進行分析,為了評估ezGeno在學習時所需的最適正樣本數目,本研究從資料庫裡蒐集兩種資料集,第一種為隨機挑選K562細胞株的10個轉錄因子,第二種則為2種轉錄因子於五種細胞株,皆分別取出不同峰值數量作為正樣本,並固定測試資料正樣本數目,由實驗結果發現,當正樣本數目高於1000時,預測表現會趨於穩定。另一方面,針對跨細胞株之轉錄因子結合分析,本研究使用資料庫中,五種最常見的細胞株之24種轉錄因子,分別取出相同數量作為正樣本,進行預測準確度分析,本研究將ezGeno認為重要的序列片段,利用MEME工具進行序列特徵分析並與JASPAR資料庫中做對照,發現除了主要結合序列特徵以外,模型也學到一些額外的特徵。此外,分析後發現使用相同資料所建構模型具有穩定性,而不同轉錄因子或不同細胞株間模型會因結合特性差異而造成建構模型有所不同。最後,本研究將建好的預測模型應用於預測單核苷酸變異位點對轉錄因子結合所造成的影響,變異位點資料分別為胸、肝及肺組織,藉由設定不同p-value之閾值分析,於三種組織中,正樣本中具顯著性的變異位點數量皆多於負樣本,顯示本研究所建立的預測模型在未來應用於尋找可能影響轉錄因子結合之變異位點具可行性。總結,本研究利用自動機器學習工具ezGeno有效建立轉錄因子於不同細胞株之結合位預測模型,大幅加速深度學習在基因轉錄調控相關研究之應用。

並列摘要


In gene expression studies, the interactions between transcription factors and their binding sites have been of great interest to bioinformaticians. This thesis focuses on comparing binding behaviors of transcription factors between different cell lines by collecting chromatin immunoprecipitation sequencing data of several transcription factors in different cell lines and analyzing them with deep learning models. However, the property of different TFs requiring sophisticated network architecture tuning to achieve satisfied performance complicates the situation. For this reason, an AutoML tool, ezGeno, was used to construct models for predicting binding specificity. Finally, the prediction models were used to analyze the effect of sequence variations on transcription factor binding. This thesis uses ezGeno to automatically build deep CNN models for predicting TF binding sites. The chromatin immunoprecipitation sequencing (ChIP-seq) data is downloaded from the ENCODE database for analysis. To evaluate the performance of ezGeno, we randomly selected 10 TFs from K562 cell line as the first dataset and 2 TFs from 5 cell lines as the second, and then extracted different numbers of peaks to build and test the models, respectively. We found that using more than a certain number of positive samples is sufficient to obtain satisfied prediction performance, even though we observed that the larger number of sequences predicted the better slightly. For the study of cross-cell type comparison, we further downloaded the ChIP-seq data of 24 TFs from five primary cell types and used the same amount of data as positive samples for prediction. We analyzed the prediction performances of 24 TFs in five primary cell types and used MEME to analyze the sequence motifs of the subsequences highlighted by ezGeno and compare them with the JASPAR database. In addition, we found that the model architectures selected by ezGeno is usually stable, while the models differed among transcription factors or cell lines due to differences in binding characteristics. Finally, the prediction model was applied to predicting the effect of single nucleotide variants on binding. The variants that affect gene expression in breast, liver and lung were used in this study. Paired sample t-test (two-tailed) was used to calculate the significance (p-values) between reference and alternative sequences. In these tissues, the number of significant variants in the positive variant list was higher than the negative one, indicating the feasibility of this analysis method. In the future, the models can be used to identify variants causing abnormal binding of transcription factors and thus affecting gene expression. In summary, this study demonstrates that ezGeno can accelerate model construction of TF binding to largely facilitate the study of transcription factor binding upon sequence variants.

參考文獻


Alipanahi, B., Delong, A., Weirauch, M. T., Frey, B. J. (2015). Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol, 33(8), 831-838. https://doi.org/10.1038/nbt.3300
Avsec, Ž., Agarwal, V., Visentin, D., Ledsam, J. R., Grabska-Barwinska, A., Taylor, K. R., Assael, Y., Jumper, J., Kohli, P., Kelley, D. R. (2021). Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods, 18(10), 1196-1203. https://doi.org/10.1038/s41592-021-01252-x
Bailey, T. L., Johnson, J., Grant, C. E., Noble, W. S. (2015). The MEME Suite. Nucleic Acids Res, 43(W1), W39-49. https://doi.org/10.1093/nar/gkv416
Castro-Mondragon, J. A., Riudavets-Puig, R., Rauluseviciute, I., Berhanu Lemma, R., Turchi, L., Blanc-Mathieu, R., Lucas, J., Boddie, P., Khan, A., Manosalva Pérez, N., Fornes, O., Leung, Tiffany Y., Aguirre, A., Hammal, F., Schmelter, D., Baranasic, D., Ballester, B., Sandelin, A., Lenhard, B., . . . Mathelier, A. (2021). JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Research, 50(D1), D165-D173. https://doi.org/10.1093/nar/gkab1113
Consortium, E. P. (2012). An integrated encyclopedia of DNA elements in the human genome. Nature, 489(7414), 57-74. https://doi.org/10.1038/nature11247

延伸閱讀