在分子演化中可以經由比較物種之間的差異而推論親緣關係,進而瞭解演化過程,其演化歷史通常適合用樹狀結構來表示,稱為親緣樹或是演化樹。利用親緣分析可能使蛋白質序列比對更具有演化上的意義,但建構親緣樹在序列數量越多時,需要耗費相當大的執行時間。近年來,雲端運算技術已開始受到許多大型組織的注意,未來可能成為分散式運算的重要核心技術,其應用層面廣泛,因此本研究的目的是試圖將雲端運算技術實際應用到生物資訊的領域上。本研究建構的親緣樹所需的序列資料是使用由瑞士生物資訊學研究所建立的人屬蛋白質序列,在進行處理後透過雲端運算環境來執行建構親緣樹的計算工作,並且對不同條件下所需的建構時間也將進一步分析。而計算序列之間的距離所使用方法為Jukes與Cantor所提出的演算法,建構親緣樹則是採用NJ與UPGMA演算法進行建樹。在雲端運算環境方面,則採用Ubuntu Live CD的Linux作業系統搭配Apache軟體基金會所研發的平行運算可程式化工具和分散式檔案系統所構成的Hadoop雲端運算架構。 實驗結果發現,建構親緣樹主要耗費時間在計算距離矩陣,透過Hadoop分散式環境在計算距離矩陣方面,映射的工作數量為影響計算時間的主要因素,而歸併數對本研究在計算距離矩陣時間影響較少。在本研究中,序列數目在40條以上時,Hadoop分散式環境比單機環境來得符合成本效益。
Phylogenetic trees are tree structures that display the sequence of evolution and phylogenetic relationship. The computation cost of phylogenetic trees is huge when numerous sequences used. Cloud Computing is widely applied in many fields, and has been noticed in many organizations, the study therefore employs Cloud Computing technique to explore the NP-complete problems in bioinformatics. In this study, homo protein sequences that established by Swiss Institute of Bioinformatics are used to construct phylogenetic trees. After data preprocessing, the tasks of constructing phylogenetic trees are mapped into the Cloud Computing environment and their executing time are analyzed. This study computed the distance between sequences by using Jukes and Cantor algorithm, and constructed the phylogenetic trees by using NJ and UPGMA. On the other hand, the Linux operation system of Ubuntu Live CD including Hadoop programming tools and distributed file systems for parallel computation that Apache software foundation supported to the computation environment, is used to rapidly establish the Cloud Computing environment. The analyzing results indicated that calculating the distance matrix is the major proportion of computation time when constructing the phylogenetic trees. Map tasks are more time-consuming than Reduce tasks in the Map/Reduce algorithm. Finally, the Hadoop distributed computing environment on multiple personal computers is more cost-effective than that on single personal computer when the number of sequences is more than 40.