基於雲端運算模式之巨量資訊分析

隨著網路應用與行動裝置的快速發展，使用者每日不斷地的產生巨量的數據資料。這一些龐大的數據資料可以透過巨量資料分析來獲得更多的洞見，提供用以更深入不同領域的專業知識，提升該領域的發展。然而，傳統的資料分析演算法皆是在單一電腦上作設計之考量。在處理巨量資料時，便可能無法滿足資料分析時所需要的硬體資源，例如：記憶體容量限制、硬碟空間容量與不同的計算模式下所需的硬體需求…等。考量這種種的因素皆增加了我們在設計巨量資料分析演算法時的困難度。因此，我們針對植基於雲端運算下之巨量資料探勘演算法在設計上所面臨的問題作深入地探討。本篇論文中，主要面臨三個大挑戰。第一，資料延展性 (Data Scalability)：在雲端運算的環境中，輸入的資料必須先轉換為鍵 (Key) 與值 (Value) 的資料表示方法。這對於傳統資料探勘演算法而言，是十分不直覺的設計。因此，這樣的設計考量將造成研究人員於設計一個具高度可擴展性的巨量資料探勘演算法時之複雜度。第二，負載平衡 (Load Balancing)：資料輸入後通常被分割成許多個資料區塊 (Data Partition) 接著透過雲端運算的平台將每一個資料區塊分配給不同的機器作資料分析。然而，在不同的機器上，所要負擔的工作負載可能不盡相同。這樣的情況將導致巨量資料探勘演算法設計的困難度。第三，資料通訊上之先天限制 (Constraints of Data Communication)：現今雲端運算的平台，其計算模式是由數個階段 (Phase) 所構成，例如：map 階段與 reducer 階段。然而，在同一個階段中，不同的機器之間不能直接作通訊。因此，在設計我們的分散式巨量資料探勘演算法時，雲端運算環境與其計算模式 (Computing Model) 的先天限制必須被考量。在本論文中，我們指出在雲端運算平台中，傳統資料探勘演算法在設計上所面臨的挑戰並針對這些挑戰提出適合的雲端計算模式與其相對應的巨量資料探勘演算法。我們首先提出兩個植基於 MapReduce 計算模式 (MapReduce Model) 的分散式資料分群演算法 (Distributed Data Clustering Algorithm)，分別為在雲端運算環境中以密度為基礎的分散式分群演算法 (Distributed Density-based Clustering) 與以格網為基礎的分散式分群演算法 (Distributed Grid-based Clustering)。所提出的演算法將巨量的輸入資料拆成數個大小幾乎相同的資料區塊並在分散式資料探勘處理時，盡量減少資料通訊上的成本負擔。如此一來，便可以更有效的對巨量的資料集作有效的分群。在第 3 章中，我們發現 MapReduce 計算模式下，諸如遞迴此類的探勘演算法將會遭遇到下述情況。例如：在同一個資料處理階段中，難以達到資料共享之目的且 MapReduce 計算模式無法反覆地處理部份的分析結果，其需要透過額外階段來完成部份分析結果之處理。為了克服這個問題，我們針對一個此類別的探勘演算法-頻繁循序樣式分析 (Frequent Sequential PAttern Mining) 提出了分散式的頻繁循序樣式探勘演算法 (Distributed Frequent Sequential PAttern Mining Algorithm in the Cloud (SPAMC))。我們採用反覆式 MapReduce 計算模式 (Iterative MapReduce Model)，克服上述問題，以高效率的處理巨量的循序資料。接下來，在第 4 章中，根據前一章節的實驗結果，我們發現即使採用了反覆式 MapReduce 計算模式，但是在頻繁循序式樣探勘(Frequent Sequential Pattern Mining) 中，反覆初始 MapReduce 的成本太高。因此，我們提出了一個新的頻繁循序式樣分析-採用均勻分散式詞彙循序樹之演算法 (Distributed Frequent Sequential PAttern Mining in the Cloud -Uniform Distributed Lexical sequence Tree (SPAMC-UDLT))。該演算法可於新提出的串流式 MapReduce 計算模式 (Streaming MapReduce Model) 中運行，以解決資料重載 (Data Reloading)、工作負載不平衡 (Unbalanced Workload) 與反覆式 MapReduce 計算模式中 mapper 之間缺乏有效通訊方法 (Lack of Communication between mappers) 之問題。在最後章節中，我們探討一些可能需要大量計算資源的資料探勘演算法模組作討論。為了克服計算資源的需求，我們設計一個運行於異質性雲端運算 (Heterogeneous cloud computing) 環境中高度可延展之以密度為基礎的分散式分群演算法 (Highly Scalable Distributed Density-based Clustering (HiClus))，其可透過異質性雲端運算的優勢，利用 GPU 上成千上百的執行緒 (thread) 作平行的資料分析運算，以滿足此類演算法需要大量計算資源的需求。在本論文中實驗結果顯示，我們所提出的數個分散式巨量資料探勘演算法與新的計算模式皆可以達到非常高的資料延展性與加速在巨量資料集下的資料探勘處理。

關鍵字

巨量資料；雲端運算； MapReduce ；分散式演算法；資料探勘；分群；頻繁循序式樣探勘； GPGPU

並列摘要

The amount of user-generated data from Internet applications has grown to be very vast, and big data analysis of this data can provide insights into domain knowledge. However, the traditional analysis approach of using a single machine may be inadequate for big data, given its high resource requirement. Limitations in hardware such as memory space, disk space, and computing model usually increase the difficulty of algorithm design in big data analysis. The challenges faced for mining big data in the cloud are threefold. (1) Data Scalability: in cloud computing, the mining algorithms have to frame in terms of keys and values, which is not natural for data analysis. This makes to design a high scalability algorithm much more complicated, (2) Load Balancing: the input data is usually split into many data partitions, and then each data partition is analyzed on a different machine in cloud computing. However, differing workloads on different machines leads the difficulty of algorithm design, and (3) Constraints of Data Communication: machines cannot directly communicate with each other in the same phase (map phase or reduce phase). Thus, we need to consider the natural limitation of cloud computing model in our distributed mining algorithm design. In this dissertation, we point out the design challenges pertaining to the cloud and propose corresponding mining algorithms with appropriate computing models for mining large datasets. InChapter2, we first devise distributed clustering algorithms using MapReduce model and propose two distributed clustering algorithms, namely Distributed Density-based Clustering (DDC) and Distributed Grid-based Clustering (DGC), which split a large dataset into small partitions of almost equal size and minimize the communication cost to efficiently cluster a large dataset. In Chapter 3, we found that the recursive algorithms in the MapReduce model may suffer from the inability on sharing data within the same processing phase. Besides, the MapReduce job cannot deal with the intermediate sub-results on distributed machines in one MapReduce round. To overcome this problem, we propose a distributed frequent sequential pattern mining algorithm, Sequential PAttern Mining Algorithm in the Cloud (SPAMC) adopting in iterative MapReduce model to deal with large amounts of data efficiently. Chapter4 is based on the results of Chapter3. Even if the iterative MapReduce model can process frequent sequential pattern mining algorithms, the initialization cost of MapReduce jobs is still too high. Besides, there are extension challenges, including data reloading, unbalanced workload, and lack of communication between mappers, associated with the iterative MapReduce model. To address these challenges mentioned above, we propose a novel distributed frequent sequential pattern mining algorithm, Sequential Pattern Mining in the Cloud -Uniform Distributed Lexical sequence Tree (SPAMCUDLT) with streaming MapReduce model to mining frequent sequential patterns. Finally, in Chapter 5, we explore a few components of mining algorithms that may demand high computational power. To solve this problem, we designed a distributed density-based clustering algorithm in heterogeneous infrastructure, namely Highly scalable Density-based Clustering with heterogeneous cloud (HiClus), which can exploit thousands of GPU threads to parallel perform mining tasks and take advantage of heterogeneous cloud computing to solve data-intensity problem. The experimental results indicate that our proposed distributed mining algorithms and computing models achieve extremely high scalability and speed up the mining processing in big data.