透過您的圖書館登入
IP:3.19.56.114
  • 學位論文

使用核苷酸位置壓縮演算法之基因變異檢定方法設計及其硬體分析器實現

Design of a Variant Calling Method and its Hardware Analyzer with Nucleotide-Position-Based Data Compression Algorithm

指導教授 : 盧奕璋

摘要


本論文提出一種基於鹼基位置排列的次世代定序資料檔案格式—vBAM與其參考軟體系統及變異分析硬體加速器。 vBAM檔案格式包含兩項子檔案,分別是vRead與rInfo,vRead儲存個別鹼基對應的資訊,rInfo則記錄整條短序列所共用的資訊。 vBAM檔案格式藉由移除不被變異檢測所使用的冗餘資訊,並將各位置上鹼基資訊以游程編碼壓縮,使得vBAM檔案得以較快速度進行變異檢測,同時其檔案大小也可縮減為BAM檔案的約百分之二十。 vBAM檔案的參考軟體系統由C++寫成,支援BAM到vBAM的編碼流程、vBAM解碼流程以及使用vBAM的變異檢測流程,在編碼時間上略慢於SAMtools由BAM轉為pileup的時間,而解碼與變異檢測速度則約為VarScan的4倍。 而硬體加速器則用於加速解碼與變異檢測作業,其晶片使用TSMC 40奈米製程,面積為2.25 mm^2,並運作於250 MHz的時鐘頻率,使用了較低精度的變異可信度資訊,但支援大部分的變異檢測功能。 與軟體相比較,使用加速器的vBAM解碼與變異檢測作業可再獲得約8倍加速。

並列摘要


In this thesis, we propose a new nucletide-position-based file format, vBAM for next-generation sequencing data, and implement a reference software system including encoding, decoding and variant calling, as well as its hardware accelerator for decoding and variant calling. The vBAM format contains two sub-files, vRead and rInfo, where vRead file stores location-typed data like bases and base qualities, rInfo file stores whole-read data such as read lengths and mapping qualities. The vBAM format removes all the redundant data which are not required by variant calling, and uses run-length coding to compress nucletide bases and base qualities. As the results, we make vBAM file have better efficiency for variant calling and need only 20% file size when compared to BAM format. The vBAM reference software, vBAM System, written in C++ supports BAM to vBAM file conversion, vBAM decoding and variant calling. The speed of vBAM encoding is only slightly slower than converting BAM to pileup using SAMtools, but the decoding and variant calling speed is about 4X faster than VarScan. The hardware accelerator is implemented using TSMC 40nm technology, with 2.25 mm^2 chip area, running at 250 MHz clock frequency. It supports most variant calling functions, with minor sacrifice in significance precision. Compare to software version, this accelerator can process 8X faster on vBAM decoding and variant calling.

參考文獻


[1] The SAM/BAM Format Specification Working Group, Sequence Alignment/Map Format Specification, Sep 2016. [Online]. Available: http://samtools.github.io/hts-specs/SAMv1.pdf
[2] Samtools Organisation, The Variant Call Format (VCF) Version 4.2 Specification, Nov 2015. [Online]. Available: http://samtools.github.io/hts-specs/VCFv4.2.pdf
[3] D. C. Koboldt, Q. Zhang, D. E. Larson, D. Shen, M. D. McLellan, L. Lin, C. A. Miller, E. R. Mardis, L. Ding, and R. K. Wilson, “VarScan 2: Somatic mutation and copy number alteration discovery in cancer by exome sequencing,” Genome Research, vol. 22, no. 3, pp. 568–576, 2012.
[4] J. Shendure and H. Ji, “Next-generation DNA Sequencing,” Nat Biotechnol, vol. 26, no. 10, pp. 1135–1145, 2008.
[5] H. Li, B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and . G. P. D. P. Subgroup, “The sequence alignment/map format and SAMtools,” Bioinformatics, vol. 25, no. 16, pp. 2078–2079, Aug 2009.

延伸閱讀