次世代基因定序資料分析之系統晶片設計與實現

DNA 定序在生物科學以及醫學治療有極大的需求，而次世代定序可藉由將長 DNA 序列分解成小片段短序列達到高速定序。DNA 資料處理是將定序後所得之短序列重建，再藉由重建之 DNA 序列找出生物體上的變異基因以及其相對位置。在 DNA 資料處理中主要包含了短序列回貼(short-read mapping)以及變異定位(variant calling) 兩個主要處理程序。其中短序列回貼是將大量定序後的短序列回貼到參考序列上，而變異定位是藉由特殊演算法在已回貼之短序列中找出與參考序列相異的可能變異。由於次世代定序的資料量相當龐大，後續的資料處理相當耗時，本篇論文提出第一個可同時實現短序列回貼以及變異定位之硬體加速方案。所採用之對角線運算陣列可以將調準(alignment)吞吐量提升兩倍，藉由有效的排程規劃可縮短一半的運算與記憶體存取的時間。Smith-Waterman 演算法以及 Viterbi 解碼演算法可透過硬體共享架構實現，將面積複雜度縮減至原有的 20%。採用台積電 28 奈米製程下線，晶片可以在 11 分鐘內完成四千萬條長度為 100 的短序列回貼，在 19 分鐘內完成完整基因序列之變異定位。與過去 FPGA 實現方式相比，能量效率以及吞吐量面積比可分別達到 149 倍以及 165 倍的提升。

關鍵字

次世代定序；短序列；序列比對；序列回貼；變異定位；史密斯瓦特曼；隱馬爾可夫模型；序列組裝

並列摘要

A strong need exists to identify the DNA sequences of various species for scientific discovery and medical diagnosis. Next-generation sequencing (NGS) supports high-speed sequencing by partitioning a long DNA sequence into smaller pieces, also known as short reads. The short reads are then assembled and the mutations (variants) can be detected by DNA data analysis. There are two phases in DNA data analysis: short-read mapping and variant calling. The task of short-read mapping is to map a larger amount of short reads onto a reference sequence and variant calling determines where the variants occur. Since the human DNA sequence contains about three billion nucleotides, the associated data analysis is very time consuming. This work presents the first ASIC to accelerate computations for both short-read mapping and variant calling. Both local and global alignments can be supported with proper configuration. A diagonal systolic array is used to double the throughput rate. Computation and memory access are efficiently scheduled to reduce latency by half. The essential Smith-Waterman algorithm and the Viterbi decoding algorithm can be supported efficiently using only 20\% of the area of the direct-mapped architecture. Designed using 28nm CMOS technology, the chip can finish short-mapping of 40M 100-bp reads in 11 minutes and variant calling in 19 minutes. The proposed ASIC respectively improves energy efficiency and throughput-to-area rate by 149x and 165x over the current state-of-the-art FPGA solution.

並列關鍵字

NGS ； short read ； alignment ； mapping ； variant calling ； Smith-Waterman ； Pair HMM ； de Bruijn graph

參考文獻

[1] Illumina, HiSeq 2500 Specification, https://www.illumina.com/.

Google Scholar

[2] Genome Analysis Toolkit, https://software.broadinstitute.org/gatk/.

Google Scholar

[3] S. Canzar and S. Salzberg, Short read mapping: an algorithm tour," Proceeding of IEEE, vol. 105, no. 3, pp. 436-458, March 2017.

Google Scholar

[4] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, "Basic local alignment search tool," J. Mol. Biol., vol. 215, no. 3, pp. 403-410, October 1990.

Google Scholar

[5] W. J. Kent, "Blat-the blast-like alignment tool," Genome Research, vol. 12, pp. 656-664, March 2002.

Google Scholar

國際替代計量

次世代基因定序資料分析之系統晶片設計與實現

全文下載

主題瀏覽