Apache Spark 運用於虛擬化技術之效益研究

Spark分散式架構以虛擬化方式建置，除了可快速建置Spark分散式環境，也可有效發揮硬體效能及彈性分配硬體資源，並節省硬體預算成本，本研究將會利用VMware虛擬化技術建置Spark 分散式系統，使用Hadoop HDFS 分散式檔案系統存取資料，資料分析方式則是使用Spark RDD In-Memory資料運算框架進行效能分析。實驗方式將會使用二次排序以及Wordcount結合TopK 這二種方法對300GB的資料量進行效能測試，相互交叉驗證，依實驗階段調整CPU、記憶體大小及運算節點，最後找出最佳的硬體配置結果。在實驗結果中可驗証Spark分散式系統Node越多資料分析越快的特性，但在處理30GB小量資料，如果每個Node硬體資源足夠時，資料分析效能到達一定瓶頸後則無法再增加。

關鍵字

大數據；虛擬化； Hadoop ； HDFS ； MapReduce ； Spark ； RDD

並列摘要

Spark distributed system architecture is deployed through virtualisation. In addition to being quick to deploy, this architecture enables the effective usage of a computer’s hardware capacity and the resilient distribution of hardware resources, which reduces hardware costs. This study used the virtualisation technology of Virtual Machine Software to deploy a Spark distributed system and the Hadoop Distributed File System to access data. Data analysis was conducted through a performance analysis of the in-memory computing framework of Spark resilient distributed datasets (RDD). In this research, the two methods of secondary sorting and WordCount combined with Top-K were employed to test performance on a data volume of 300 GB. These two methods were then cross-validated, and the system CPUs, memory, and computing nodes were adjusted according to the experimental phases to determine the optimal hardware configuration. Experimental results verified that using more nodes resulted in more rapid data analysis in a Spark distributed system. However, when processing of a small data volume such as 30 GB was performed, and given that the hardware resources of each node were sufficient, data analysis performance could not be improved further after it had reached a certain threshold.

並列關鍵字

Big Data ； VMware ； Hadoop ； HDFS ； MapReduce ； Spark ； RDD

參考文獻

[8] 周建廷(2011)。國立臺灣師範大學碩士論文。利用MapReduce軟體架構於Hadoop叢集進行地貌型直接逕流模組演算之研究。

[14] Jeffrey Dean, Sanjay Ghemawat:MapReduce: simplified data processing on large clusters. Commun. ACM, Vol. 51, pp. 107-113, 2008.

[17] 簡玠忠(2013)。國立中興大學碩士論文。基於Hadoop框架建立巨量資料分析處理模型研究。

[26]Huang Chao-Qiang , et al. RDDShare: Reusing Results of Spark RDD, IEEE International Conference on Data Science in Cyberspace (DSC), June 2016

[27] R. Uhlig, G. Neiger, et al. Intel virtualization technology, IEEE Computer Society Computer, May 2005

國際替代計量

Apache Spark 運用於虛擬化技術之效益研究

全文下載

主題瀏覽