利用深度學習預測T細胞受體與抗原結合的特異性

預測 T 細胞受體（T cell receptor，TCR ）與主要組織相容性複合物（Major histocompatibility complex，MHC）和胜肽（Peptide）結合的相互作用，仍然是極具挑戰性的計算問題。這一挑戰主要源於三個主要因素：實驗數據準確性、稀缺性和問題本身的高複雜性。一般而言，關於新生抗原（Neoantigen）和抗原生物學中未解決的基本問題之一是：為什麼並非所有新生抗原或抗原都會引發 T 細胞反應，對此，如果能準確預測新生抗原/抗原和 TCR 之間相互作用，將對於了解癌症進展、預後和對免疫治療的反應之相關研究至關重要。另一方面，近期許多自然語言處理（Natural Language Processing，NLP）相關研究顯示，可將蛋白質序列視為句子，而將胺基酸視為單詞，因此，許多相關研究開始嘗試使用類似自然語言處理的技術，從蛋白質序列數據庫中提取有用的生物信息。日前，有一些可公開使用的蛋白質語言預訓練模型被釋出，而且已被證明有助於各種下游預測任務。因此，本研究旨於建立了一個以蛋白質語言模型ProtBert 為編碼基礎的預測模型，預測由 I 類主要組織相容性複合物呈現的新生抗原和一般 T 細胞抗原的 TCR 結合特異性。本研究針對兩個預測問題，一個是預測MHC-I和peptide的結合問題，一個是TCR和peptide-MHC（pMHC）的結合問題，比較不同編碼方式，結果顯示蛋白質語言模型在兩個問題上都可以提升預測準確率。最終，本研究提出搭配集成學習，進一步提升以ProtBert為基礎的預測模型之準確性，期望能強化預測T細胞受體與抗原結合特異性之後續應用。

關鍵字

T細胞受體；一類主要組織相容性複合物；胜肽

並列摘要

Predicting the interaction of T cell receptors (TCR) with complexes of major histocompatibility and peptide (pMHC) remains challenging. This challenge involves three main issues: accuracy of data, sparse and problem complexity. One of the fundamental and unanswered question about neoantigen and antigen is why not all antigen elicits T cell responses although the peptide might have been present on the MHC cell surface. Accurate and comprehensive characterization of the interactions between neoantigen/antigen and TCR is critical for understanding cancer progressions, prognosis, and the response of immunotherapy. On the other hand, many recent NLP studies have shown that protein sequences can be regarded as sentences and amino acids as words. In this regard, researchers can use natural language processing to extract biological information from protein sequence databases. Recently, there are some successful pre-training protein language models publicly available. This study then developed a prediction model based on protein language model ProtBert to predict TCR binding specificity of neoantigen/antigen presented by major histocompatibility complex class I. The results demonstrated that using protein language model can improve the accuracy of prediction on both problems: predicting MHC-peptide binding and TCR-pMHC binding. Moreover, this study integrated ensemble learning to further improve the prediction accuracy. The ProtBert-based ensemble model is expected to facilitate the immunogenomics studies related to TCR binding in the near future.

並列關鍵字

TCR ； TCR-pMHC ； MHC-I ； peptide

參考文獻

Devlin, J., M.-W. Chang, K. Lee and K. Toutanova (2018) "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805.

Google Scholar

Elnaggar, A., M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik and B. Rost (2020) "ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing." arXiv:2007.06225.

Google Scholar

Henikoff, S. and J. G. Henikoff (1992). "Amino acid substitution matrices from protein blocks." Proc Natl Acad Sci U S A 89(22): 10915-10919.

Google Scholar

Krogsgaard, M. and M. M. Davis (2005). "How T cells 'see' antigen." Nature Immunology 6(3): 239-245.

Google Scholar

Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma and R. Soricut (2019) "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." arXiv:1909.11942.

Google Scholar

主題瀏覽