以短基因片段重組基因組之演算法設計與硬體實作

生物的行為與特徵由基因組主導。基因組即為細胞核內的去氧核醣核酸(deoxyribonucleic acid, DNA)序列。DNA序列由四種鹼基組成-A、C、G和T，可視為由四種字母組成的長字串。藉由一生物個體的DNA序列的字母組成，可分析其可能帶有的遺傳疾病。人類DNA序列理所當然地受到最大的關注。 DNA定序儀(DNA sequencing machine)可讀出DNA序列的字母組成。然而現存的DNA定序儀對於輸入的DNA序列有長度限制(約20~1000個鹼基)。人類DNA序列長度約為3×〖10〗^9個鹼基，無法以DNA定序儀直接讀取。因此，要讀出人體DNA序列的字母組成，需從人體取出多份DNA序列，隨機地切成1000個鹼基以下的片段給DNA定序儀讀取。這些片段稱為短基因片段。讀出短基因片段的字母組成後，再用這些短基因片段重建出原本的DNA序列。不同人之間的DNA序列極為相似，因此可參考另一份已重建完成的人類DNA序列(參考序列)，找到各短基因片段在參考序列上與其最相似的片段，以及其所在位置(稱為最佳對應位置)。將各短基因片段放置在最佳對應位置，即可拼湊出原本的 DNA序列。為短基因片段找到參考序列上的最相似片段，稱為短基因片段排列(short read alignment)。現存的短基因片段排列演算法都以軟體實作為主。當基因組重建成為醫療院所之間普及的醫療項目，軟體的速度勢必無法因應龐大的需求。短基因片段排列包含候選對應位置產生器和相似度分數與確切排列計算兩部分。現存演算法的候選對應位置產生器在硬體實作時有記憶體使用量過大的問題。本論文提出以布隆過濾器為基礎的候選對應位置產生器，減少記憶體使用量。在相似度分數計算與確切排列方面，現存演算法使用史密斯-瓦特曼演算法(Smith-Waterman algorithm)，並以平行陣列(systolic array)進行硬體實作。本論文改進了現存作法的平行陣列中的處理單元，使得處理單元有更小的電路面積與更短的關鍵路徑。

關鍵字

基因組重建；短基因片段排列；布隆過濾器；候選對應位置；史密斯-瓦特曼演算法

並列摘要

Genome governs the behavior and traits of a being. It is the DNA sequence in cell nuclei. DNA sequences consist of four types of nucleobases - A, C, G and T. A DNA sequence can be regarded as a long string constructed by these four alphabets. By the DNA sequences of an individual, we can analyze potential hereditary diseases it may have. Human DNA sequence is surely the most popular topic. DNA sequencing machine can decode the alphabet on a DNA sequence. However, contemporary DNA sequencing machines have a length limit for incoming DNA sequences(about 20~1000 nucleobases). The total length of human DNA sequences is about 3×〖10〗^9 nucleobases, which can not be directly decoded by DNA sequencing machines. So, to decode the alphabet of the DNA sequence of a human, we need to extract multiple copies of the DNA sequences from him/her, and then randomly cut the sequences into small pieces with length under 1000 nucleobases for DNA sequencing machines to decode. These small pieces are called short reads. After the alphabet on short reads is decoded, we reconstruct the original DNA sequence by these short reads. Since the DNA sequences between two different human are very similar, we can refer to a reconstructed DNA sequence of another human(which is called reference sequence). For each short read, find the most similar subsequence and the location of the most similar subsequence (which is called best mapping location) on the reference sequence. The original DNA sequence can be reconstructed by putting each short read to the best mapping location. The process of finding the most similar subsequence on the reference sequence is called short read alignment. Currently, short read alignment algorithms are mostly done in software. After genome reassembly become a popular medical service in hospitals, the speed of software is not enough for huge amount of needs. Short read alignment contains candidate mapping location(CML) generator and similarity score with actual alignment computation. State-of-the-art CML generators consumes too much memory when implemented in hardware. We propose Bloom filter-based CML generator, which has much lower memory consumption than state-of-the-art solutions. Similarity score computation and actual alignment computation are usually done with Smith-Waterman algorithm, which is implemented in systolic array for hardware implementations. We propose a modified processing element in the systolic array to have a smaller area and shorter critical path.

並列關鍵字

Genome reassembly ； short read alignment ； Bloom filter ； candidate mapping location ； Smith-Waterman algorithm

參考文獻

[3] Wikipedia – Human Genome Project

[4] Wikipedia – Euchromatin

[5] B. Langmead & S. Salzberg, “Fast gapped-read alignment with Bowtie 2”, Nature Methods 9, 357–359 (2012) doi:10.1038/nmeth.1923

[7] Wikipedia – Indel

[8] Wikipedia – Point Mutation

國際替代計量

以短基因片段重組基因組之演算法設計與硬體實作

全文下載

主題瀏覽