巨量資料分析平台Spark效能分析研究

現今已步入資訊資料爆炸的時代，巨量資料儲存與分析已不可或缺，隨著巨量資料分散倉儲管理與分散運算的框架推陳出新，本研究將使用Spark即時記憶體運算的框架做效能分析，取用中華民國健保資料庫(1999-2008)約35GB原始資料，藉由隨機抽樣方式擴充資料量，產生約105GB的分析資料，將資料使用決策樹迴歸分析(Decision Tree Regression)分析方法計算，分別以1.單機運算、2.五台分散叢集、3.十台分散叢集運算做為效能分析的基礎，應用在Spark分散叢集的效能比較。　　並且討論使用相同的方法，資料量的成長是否能以擴充節點來減緩運算時間的成長，而且對於需要大量分析運算的需求，是否能以容忍的運算時間內推薦使用多少數量的的分散運算叢集，本研究對於巨量資料框架做為驗證與擴增叢集的依據。

關鍵字

巨量資料

並列摘要

Now we are in the era of explosion information. Big data storage and analysis must be handled. MapReduce and its variants have been highly successful in implementing large-scale data-intensive applications on commodity clusters. Spark can outperform Hadoop by 10x in iterative machine learning jobs. This paper focuses on testing of Spark Effectiveness with Ambulatory care expenditures by visits (CD) with size of 35GB in R.O.C National Health Insurance Database from AD 1999 to 2008. In this test, we constructed a computing distribution framework with Hadoop and Spark. we get framework computed the time which is different number nodes cluster for Decision Tree Regression machine learning algorithms. The machine learning algorithms used different data size with Random sampling from Ambulatory care expenditures by visits(CD) of NHI database. Finally, we will find right number nodes of distributed system for managing the data, and we also can support in shorting the time by adding some nodes to cluster of machines for scalable data