透過您的圖書館登入
IP:18.221.9.10
  • 學位論文

全新基因序列組裝平行演算法與硬體架構設計

A Parallel Algorithm and its Hardware Architecture Design for De Novo Sequence Assembly

指導教授 : 盧奕璋

摘要


全新基因序列組裝的結果在基因定序計畫中扮演著一個非常重要的角色,由於定序資料非常龐大,因此其過程必須耗費大量的時間。如何去減少運算所需的時間一直都是個很重要的議題。全新基因序列組裝的流程可再細分為前置過濾以及序列組裝兩部份。前置過濾主要負責將品質較差的序列根據不同的過濾條件濾除,來達到較好的組裝結果;序列組裝則是負責將過濾完的短序列組裝成長序列,使其更接近原有的基因序列。 在本篇論文中,我們針對兩部份各提出了一個平行化的架構。在前置過濾的部分,由於其流程中,各序列的過濾結果互不影響,因此透過硬體平行處理以及管線化架構可以使得過濾所需時間大幅縮短。在序列組裝演算法部分,基於尤拉路徑演算法的概念,我們將其建構de Bruijn圖的過程改進,使得原本比較適合單一執行緒的演算法更適合平行架構,並且保有與原架構相近的輸出解品質。此外,不論是前置過濾或是序列組裝演算法的部分,我們都可以調整其平行化的程度,來符合使用者所需的加速倍率以及硬體限制。 在前置過濾器部份,我們透過了Terasic DE4-230可程式化邏輯閘陣列(Field Programmable Gate Array, FPGA)來實作。相較於傳統透過單一執行緒的電腦處理,我們的架構可以縮短54.2 %的過濾時間。而序列組裝演算法部份則是透過訊息傳遞介面(Message Passing Interface, MPI)將其實做,運算所需時間相較於傳統演算法則縮短了22~35 %。

並列摘要


De novo sequence assembly plays an important role in genome projects. How to reduce the long processing time caused by the large amount of genome data becomes an important issue. The process of de novo sequence assembly can be divided into two parts: the pre-processing and the assembly stages. In the pre-processing stage, the program filters out the low quality reads, while in the assembly stage, the program assembles short reads into long contigs. In this thesis, we propose parallel architectures for both parts. In the pre-processing, the operations of each read is independent, so we can easily use pipeline and parallel techniques of hardware design to accelerate the process. In the assembly stage, based on Eulerian path algorithm, we modify the construction of de Bruijn graph to make it more suitable for parallel architecture. We can adjust the degree of parallelism for different hardware specifications. We use a Terasic DE4-230 FPGA board as the platform to implement our pre-processing hardware design. The FPGA-PC system saves 54.2% of execution time required by the PC-only design of for the pre-processing stage. The parallel algorithm is implemented by Message Passing Interface (MPI), the proposed algorithm can save 22~35% processing time if 65 CPU cores are used.

參考文獻


[1] F. Sanger, S. Nicklen, and A. R. Coulson, "DNA sequencing with chain-terminating inhibitors," Proceedings of the National Academy of Sciences, vol. 74, pp. 5463-5467, December 1, 1977 1977.
[2] T. Finsterbusch and A. Mankertz, "Porcine circoviruses—Small but powerful," Virus Research, vol. 143, pp. 177-183, 2009.
[3] A. Nakabachi, A. Yamashita, H. Toh, H. Ishikawa, H. E. Dunbar, N. A. Moran, et al., "The 160-Kilobase Genome of the Bacterial Endosymbiont Carsonella," Science, vol. 314, p. 267, October 13 2006.
[4] Y.-L. Huang, "Architecture design of the parallel processing element and interconnection network for de novo sequence assembly," M.S. Thesis, Graudate Institute of Electronics Engineering, National Taiwan University Taiwan, 2012.
[5] D. R. Zerbino and E. Birney, "Velvet: Algorithms for de novo short read assembly using de Bruijn graphs," Genome Research, vol. 18, pp. 821-829, May 1 2008.

延伸閱讀