整合核糖核酸測序與臨床資料並應用多任務學習方法於多癌症預後預測

癌症是全球主要的死亡原因之一，我們透過精準的癌症預後預測，篩選並給予高風險病患最合適的治療，提升病患的存活率。然而在現今複雜且龐大的異質醫學資料裡，深度學習相較於統計或傳統機器學習的演算法，在不同方面皆表現得更為出色，讓我們得以處理這些複雜且龐大的異質醫學資料。除此之外，在深度學習中，我們可以運用多任務學習和多模態學習，讓模型學習不同癌症間的知識，並利用這些知識提供精準的癌症預後預測。作為案例研究，我們使用 The Cancer Genome Atlas (TCGA) 計畫所獲得的三個資料集（乳癌、肺癌和大腸直腸癌），實作了用於癌症預後預測的多任務雙模態神經網路，整合 RNA Sequencing (RNA-Seq) 和 Clinical 的資料。除此之外，我們還為此打造一個可重複使用的 Python 程式碼庫，其中包含下載並處理 TCGA 計畫資料庫中的資料、RNA-Seq 資料前處理和深度學習模型開發架構。實驗結果證實，本論文所提出之多任務雙模態神經網路的 concordance index (c-index) 和 area under the precision-recall curve (AUPRC) 分別大幅提高了 26% 和 41%，為此研究方向踏出嶄新的一步。我們相信此研究方向，可以透過深度學習，解開不同癌症之間潛在的關係，為精準醫療奠定更進一步的基礎。

關鍵字

多任務學習；多模態學習；深度學習；生物資訊學；特徵選取

並列摘要

Cancer is one of the leading causes of death worldwide. With accurate cancer prognosis predictions, patients with high risk could be screened out for proper treatments to increase their chance of survival. In this study, we integrate medical data from multiple cancer types and utilize multi-task learning to exploit the shared knowledge among them. As a case study, we implemented the multi-task bimodal neural network, which can handle both RNA-Seq and clinical data, for cancer prognosis predictions with three datasets, including breast, lung, and colon cancer, obtained from the TCGA project. Moreover, we developed a reusable Python code base, including requesting data from the TCGA project database, data pre-processing, and the development pipelines for deep learning models. Experimental results showed significant improvements up to 26% and 41% in the c-index and AUPRC, respectively. Our research marks the initial steps of employing multi-task learning for prognosis predictions among different cancer types.

並列關鍵字

multi-task learning ； multimodal learning ； deep learning ； bioinformatics ； feature selection

參考文獻

[1] R. L. Siegel, K. D. Miller, H. E. Fuchs, and A. Jemal, “Cancer statistics, 2022,” CA: a cancer journal for clinicians, 2022.

Google Scholar

[2] M. J. Barry, “Prostate-specific–antigen testing for early diagnosis of prostate cancer,” New England Journal of Medicine, vol. 344, no. 18, pp. 1373–1377, 2001.

Google Scholar

[3] C. Wu, F. Zhou, J. Ren, X. Li, Y. Jiang, and S. Ma, “A selective review of multi-level omics data integration using variable selection,” High-throughput, vol. 8, no. 1, p. 4, 2019.

Google Scholar

[4] P. Indyk and R. Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing, 1998, pp. 604–613.

Google Scholar

[5] Y.-H. Lai, W.-N. Chen, T.-C. Hsu, C. Lin, Y. Tsao, and S. Wu, “Overall survival prediction of non-small cell lung cancer by integrating microarray and clinical data with deep learning,” Scientific reports, vol. 10, no. 1, pp. 1–11, 2020.

Google Scholar

主題瀏覽