利用生物資訊演算法最佳化癌症基因定序：癌症基因重組與癌症基因比對

癌症是世界上最常見的死亡原因之一，而DNA損傷及DNA突變則是形成癌症的根本原因。DNA損傷指的是DNA物理上的異常，如單、雙螺旋鏈斷裂；而DNA突變則代表DNA鹼基之改變。對於癌症患者而言，通過早期檢測能夠提高治癒的機率，同時進行切除手術，痊癒不無可能。癌症基因定序，包含了癌症基因重組與癌症基因比對，能夠檢測出癌症DNA之突變並及早發現罹癌可能性。癌症基因重組需要大量的時間及空間處理、分析數以百萬的可行解。因此癌症基因重組問題已被認定為NP問題。而癌症基因比對需要高度且密集的計算，因為大又長的序列對記憶體及速度的要求，使得實作演算法更具有挑戰性，從形式上來看，癌症基因比對也是NP問題。癌症基因定序中的癌症基因比對步驟在癌症檢測技術中是一項重要的環節。已經有幾篇文章提出快速的比對技術，如特徵正交分解(簡稱POD)技術和基於Smith – Waterman 的啟發式演算法等等。本論文提出利用生物資訊模組建構癌症基因重組及比對演算法檢測及快篩癌症，以增加癌症檢測之準確率。本論文所提出之癌症基因定序方法，在癌症基因重組部分，首先將輸入之DNA切割成小片段。這些片段將用來建造De Bruijn 圖形，每個片段分別代表De Bruijn 圖形的節點。每個片段在切割時，會同時標記上片段的順序。同時，使用變形的Euler路徑方法，在De Bruijn上找出一條最佳匹配路徑以解決癌症基因重組問題。尤拉路徑首先會選擇一個片段當作節點，並且經過每個邊所限制的次數。如頂點上有分支點，則由拉路徑將會同時走兩條路徑。而在癌症基因比對方面，一個經過生物資訊最佳化的Smith – Waterman 演算法，利用生物資訊方法，將兩段重組過後的DNA片段建立成一組生物計算Smith–Waterman 矩陣可以非常迅速的解決癌症基因比對問題。最佳化之癌症基因定序方法，充分的利用平行計算克服了時間複雜度的瓶頸，並且改進癌症DNA序列重組與比對，使之效率更甚。本論文實驗結果證得，癌症基因重組所耗費之時間複雜度為O(n^3)，而癌症基因比對所耗費之時間複雜度為O(3n-1)。在未來，將發展更多癌症基因定序相關演算法，在基因工程領域追求更佳之擴展性及準確率。或許考慮在不久的將來，應用更多先進的方法使得我們所提出的演算法得到更快速的癌症相關問題之解決方案，或發展新的技術，如使用特徵正交分解技術(POD)。

關鍵字

最佳化；平行計算；癌症基因定序；癌症基因重組；癌症基因比對； NP-困難；生物資訊計算； De Bruijn 圖形； Euler 路徑； Smith–Waterman 演算法；特徵正交分解

並列摘要

Cancer is one of the most common causes of death worldwide. The underlying causes of cancer are DNA damage and DNA mutation. DNA damage is physical abnormalities in the DNA, such as single and double-strand breaks; and DNA mutation represent a change in the nucleobases of the DNA. The best opportunity for improving survival of cancer patients is through early detection, when curative surgical resection is possible. Through cancer genome sequencing includes cancer genome assembly and cancer genome alignment can detect the mutation of cancer DNA for early detection of cancer. Cancer genome assembly needed large space and time to generate and analysis millions of possible solutions. Hence, cancer genome assembly problem has been recognized as NP-hard. Cancer genome alignment is highly computationally intensive. The implementation of the algorithm has become more challenging as the memory requirements and speed, due to the large long sequence. Formally, the problem is also NP. The cancer genome alignment step in the cancer genome sequencing is an import part of cancer detection. There are some papers proposed fast alignment algorithms, such as proper orthogonal decomposition (POD) technology and heuristic approach based on Smith – Waterman algorithm etc. This thesis proposes a newly developed bio-informatics model to construct cancer genome sequencing algorithm for cancer detection and rapid screening for increasing the detection accuracy with cancer. The cancer genome sequencing approach of this thesis, in the cancer genome assembly, first cut input DNA into several small fragments. These fragments are used to construct De Bruijn graph, each fragments represent a vertex of De Bruijn graph. Fragments will mark an order number when these fragments are cut. Then we use a modified Euler path to find an optimal path on the constructed De Bruijn graph for solving cancer genome assembly problem. Euler path first choose a fragment as root, then go through each edges for a limited times. If a vertex has branches, Euler path will go every branch simultaneously. In the alignment part, an optimized bioinformatics Smith–Waterman algorithm constructs a bioinformatics Smith–Waterman matrix by two reassembly DNAs and traces back with the highest score for aligning two DNAs. The optimized cancer genome sequencing approach is fully utilizing parallelism to conquer time complexity bottleneck, and improves any cancer DNA sequence assembly and alignment more efficient. The experimental results of cancer genome assembly can be found in O(n^3) polynomial bound. And the experimental results of cancer genome alignment is also in O(3n-1) polynomial bound. In the future, we will develop more cancer genome sequencing related algorithms for pursuing better scalability and accuracy in the genetic engineering areas. Perhaps consider the possibility of using more advanced methods for cancer problem quick resolution in the near future or developing newly approaches such as POD technology.

並列關鍵字

Optimization ； Parallel Computing ； Cancer Genome Sequencing ； Cancer Genome assembly ； Cancer Genome Alignment ； NP-hard ； Bioinformatics Computing ； De Bruijn Graphs ； Euler Path ； Smith–Waterman Algorithm ； Proper Orthogonal Decomposition

參考文獻

[1] World Health Organization, “Cancer,” Fact sheet N°297 January 2013.

[3] M. Venturi, R. J. Hambly, B. Glinghammar, J. J. Rafter, I. R. Rowland, “Genotoxic activity in human faecal water and the role of bile acids: a study using the alkaline comet assay,” Carcinogenesis, 18(12):2353-9, December 1997.

[4] S. Hakomori, “Cancer Res,” 56, 5309-5318, 1996.

[5] D. Peterson, C. H. Lee “A DNA-based Pattern Recognition Technique for Cancer Detection,” Annual International Conference, IEEE, September, 2004.

[7] C. Burks “DNA sequence assembly,” Engineering in Medicine and Biology Magazine, IEEE, 13(5), 771-773. November/December 1994.

國際替代計量

利用生物資訊演算法最佳化癌症基因定序：癌症基因重組與癌症基因比對

未授權

主題瀏覽