序列組裝用前置過濾器與後置處理器之平行架構與硬體實現

生物序列組裝(Sequence Assembly)流程中，由於前端感測器辨認螢光標記有一定的出錯機率，因此在序列重組前需經過前置處理。前置處理程序是將雙鹼基顏色編碼(2-base color code)序列片段輸入過濾器，移除低品質之序列片段以減少序列重組錯誤產生，並從顏色編碼轉換成可套用一般組裝程式的偽鹼基。經過序列重組計算後，再實行後置處理將已組裝之偽鹼基序列還原為核酸鹼基。由於資料龐大，完成一次流程必須耗費大量時間，其中以前置過濾、重組計算與後續處理部份約各佔三分之一。在本篇論文中，我們提出了一用於Velvet軟體序列組裝之前置過濾器及後置處理硬體架構。由於前置過濾中序列片段之間彼此互不影響，所以可利用硬體平行處理的優勢及管線化電路過濾條件的設計，使計算時間大幅縮短；另依照前置過濾器的排程方式，即使序列片段長度增加，仍不會改變運算時間。除此之外，前置過濾器具有擴充性，可依據序列片段長度調整設計，欲進一步加速，可使用增加平行化程度的方式達成。後置處理器的功能則為將組裝程式輸出之資料由偽鹼基轉換為核酸鹼基，提出以僅用整體序列前端的序列片段輔助將Velvet組裝好的偽鹼基轉換為核酸鹼基，使計算時間大幅下降，同樣也利用管線化設計進一步降低電路計算所需時間。處理器硬體以TSMC 90製程實現，可運算最大序列片段長度為63，數量為256M條，實現之電路分為兩塊晶片，面積分別是2367051 um2、774724 um2。操作頻率設計為100 MHz，相較於軟體分別增進超過11000倍及34000倍。

關鍵字

生物序列組裝； Velvet ；序列過濾；硬體架構實現

並列摘要

In DNA Sequence Assembly, assembly tools align and merge high-throughput DNA sequences to reconstruct original DNA sequence. In the general assembly flow, the pre-processor first filters out low quality reads to decrease the error rate as well as the problem size. Secondly, bases are changed from the 2-base color coding to pseudo bases that are compatible with general assembly tools. The second step is sequence assembly. After assembly, the post-processor changes pseudo-base results back to genome sequences. Since each of the three steps takes about 1/3 of the total computation time, we decide first to improve the performance of the pre-filter and post-processor. In this study, we proposed a pre-filter and post-processor hardware design for the assembly tool, Velvet. In the pre-filter, due to hardware advantages of parallel processing and pipelining, reads can be filtered efficiently. The execution time is linearly proportional to the number of reads but independent of the read length. In addition, it is a scalable architecture to speed up further by increasing the number of parallel computational units. As for the post-processor, we propose a new approach to change pseudo bases back to real bases by using only first several short reads in the whole sequence and the Velvet results which is also implemented in a pipeline manner. The chips are implemented with TSMC 90 nm technology. It is capable of processing 256 M reads and the maximum read length is 63. The two chip sizes are 2367051 um2 and 774724 um2, and the chips operate at 100 MHz. When compared to software approaches, the speed-up of the pre-filter and post-processor are as high as 11000 times and 34000 times, respectively.

並列關鍵字

Sequence Assembly ； Velvet ； DNA read filtering ； hardware implementation

參考文獻

[1] T. Christina and F. Mario, "Computational biology methods and their application to the comparative genomics of endocellular symbiotic bacteria of insects," Biological Procedures Online, vol. 11.

[2] M. L. Metzker, "Sequencing technologies—the next generation," Nature Reviews Genetics, vol. 11, pp. 31-46, 2009.

[3] K. V. Voelkerding, S. A. Dames, and J. D. Durtschi, "Next-generation sequencing: from basic research to diagnostics," Clinical chemistry, vol. 55, pp. 641-658, 2009.

[4] M. Pop, S. L. Salzberg, and M. Shumway, "Genome sequence assembly: Algorithms and issues," Computer, vol. 35, pp. 47-54, 2002.

[5] D. J. Studholme, R. H. Glover, and N. Boonham, "Application of high-throughput DNA sequencing in phytopathology," Annual review of phytopathology, vol. 49, pp. 87-105, 2011.

國際替代計量

序列組裝用前置過濾器與後置處理器之平行架構與硬體實現

全文下載

主題瀏覽