迭代大數據運算效能改善¬¬－以Spark程式為例

隨著大數據時代的來臨，使得眾多學者繼而接踵的開發出許多處理大數據資料的工具，而Spark是近年來最受歡迎的大數據分析工具，其資料處理效能超過Hadoop。Spark因為記憶體運算的特性適合處理迭代類型應用與交互式資料探勘。然而，Spark目前在處理迭代類型應用時，仍有改善的空間，起因於Spark執行牽涉到Shuffle的指令時，其中的Shuffle會因為跨節點的資料傳遞導致運算時間增加，如果能減少Shuffle次數，Spark的效能將會提升。本研究將透過對程式部分修改以改善Spark在執行迭代類型應用的效能，並找出其他可以藉由此程式修改方法的指令，並於實驗中實作，在三個實驗案例中，搭配著不同的輸入資料集與迭代次數進行實驗，並找出最適合應用程式修改的情況。並於實驗結果中發現，在使用牽涉到Shuffle的多個RDD指令時，本研究透過下載一個較小的RDD策略以取代使用Shuffle指令，可改善最多30%以上的執行時間。

關鍵字

大數據； Spark ；迭代型應用

並列摘要

The Big Data Era brings a lot of big data analysis tools. Spark, which the features of in-memory processing fit iteration and interaction data mining, is the most popular analysis tools, and the performance of data processing is better than Hadoop. However, there are some disadvantages in Spark, such as big data causes cross-node data transferring and it also makes compute time increasing. If Spark executes less Shuffle operations, the Spark’s performance is improved. This study modified the program to enhance the performance of iterative application. Thus, this study uses three empirical researches with diverse datasets and iterations and try to find the most suitable modified program codes. Finally, we found while using the several RDD Shuffle operations that can use a strategy to download a smaller RDD replace the Shuffle operations. The simulation results show the execution improvment time is up to 30%.

並列關鍵字

Big Data ； Spark ； Iterative application

參考文獻

Purdom Jr, P. (1970). A transitive closure algorithm. BIT Numerical Mathematics, 10(1), 76-94.

Apache Hadoop(2014), Retrieve November 1, 2015 from: https://hadoop.apache.org

Google Scholar

Apache Spark(2016), Retrieve March 21, 2016 from: https://spark.apache.org

Google Scholar

Gu, L., & Li, H. (2013, November). Memory or time: Performance evaluation for iterative operation on hadoop and spark. In High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.

Google Scholar

Hinton, A., Kwiatkowska, M., Norman, G., & Parker, D. (2006). PRISM: A tool for automatic verification of probabilistic systems. In Tools and Algorithms for the Construction and Analysis of Systems (pp. 441-444). Springer Berlin Heidelberg.

Google Scholar

國際替代計量

迭代大數據運算效能改善¬¬－以Spark程式為例

主題瀏覽