基於Hadoop叢集之具關聯式規則探勘雲端系統設計與效能之研究

由於各類應用的資料規模或特性已經超過傳統資料庫系統的處理能力，因此需要新的有效率的資料處理技術來巨量處理。因此，必須有巨量資料的儲存空間。另一方面，也必須思考如何處理或運用這些儲存下來的巨量資料。而巨量資料(Big data)通常是指大容量(high-volume)、即時性(high-velocity)及多元化(high-variety)的資訊，必須透過有效率的處理程序，來促進資料的分析及應用。而對於巨量資料來說，資料探勘是一個重要的問題，尤其是雲端應用服務在巨量資料上的資料分類、關聯分析和預測技術上的開發，在未來將是一個提升有效的應用服務的關鍵。本研究提出設計於開放式原始碼Hadoop框架平台上建構一以MapReduce雲端巨量處理架構並整合HDFS、HBase與MapReduce等開放原始碼的子架構，實作平行化關聯式規則演算法與循序樣式分析演算法，其目的在於提升關聯規則與循序樣本資料探勘處理巨量資料之效能系統設計。並以Hadoop叢集實測其效能，其中影響關聯式規則演算法與循序樣本分析演算法效能的因素有實驗環境、演算法特性、叢集節點數目、輸入資料特性、資料總量、Map Tasks總數以及Reduce Tasks總數等參數，本研究將以兩種實驗資料集，透過各種參數之組合進行系統效能測試，並分析這些資料結果，評估與設計一個最適合旅遊網使用者連線紀錄特性的關聯式規則與循序樣式分析演算法與實驗環境跟參數設定。

關鍵字

巨量資料；資料探勘；關聯式規則分析；循序樣式分析； Hadoop ； MapReduce ； HBase ； HDFS ；平行化

並列摘要

Due to the size or characteristics of various types of data applications exceeded the processing capacity of conventional database systems, the demand to handle massive data for new efficient data processing techniques is increassing. Hence, the ways to tore and process a huge amount of data are required. The big data usually refers to high-volume, high-velocity and high-variety information. It is needed through an efficient process to facilitate the analysis of information and applications as for the big data. Furthermore, the data mining is one of an important issue for big data processing. Especially for application services big data classification, association analysis and forecasting technology development are used, to enhance the application effectively. Therefore, in this thesis, a data intensively-cloud architecture is proposed by integrating HDFS, HBase and MapReduce based on the Hadoop platform. Parallel algorithms of association rules and sequential pattern algorithms are implemented to enhance the performance of data mining system. The PC clustering experimental enviornment is setup to evaluate the factors affecting the association rules algorithms and sequence patterns algorithm performance including algorithms characteristics, the number of cluster nodes, input data characteristics, the amount of data, the total number of Reduce Tasks and the total number of Map Tasks. In addition, two types of experimental data sets from a simulation data and a real user access log of a travel news website are used to assess the performance and results of four association rules with various parameter settings.

並列關鍵字

Big data ； Data mining ； Association rules analysis ； Sequential pattern analysis ； Hadoop ； MapReduce ； HBase ； HDFS ； Parallel

參考文獻

[5] 蕭為元, “應用文字探勘及機器學習技術於股票推薦系統之研究”, 碩士論文, 屏東科技大學資訊管理系所, 2013

[7] Lam, C. (2010). Hadoop in action. Manning Publications Co..

[22] 阮有淨江, “設計與實作一個將單機環境軟體轉換到Hadoop基礎分散式環境的MapReduce框架”, 碩士論文, 國立成功大學製造資訊與系統研究所碩博士班, 2013

[34] 曾坤福, “基於PHP與MySQL應用程式之Apache HBase 分散式資料庫與關聯式資料庫中介橋接機制設計與實作”, 碩士論文, 樹德科技大學資訊工程系碩士班, 2012

[37] 黃韋勳, “基於HBase雲端資料庫之智慧城市感測資訊存取服務設計與實作”, 碩士論文, 淡江大學資訊工程學系資訊網路與通訊碩士班, 2013

國際替代計量

基於Hadoop叢集之具關聯式規則探勘雲端系統設計與效能之研究

全文下載

主題瀏覽