以深度學習方法優化Apache Spark叢集任務排程之研究

近年來大數據的普及，因應而出的框架越來越多，其中Apache Spark最廣為人知，Spark最初是由加州大學柏克萊分校的AMPLab開發，而後2013年將專案捐贈給Apache軟體基金會，他是一個基於記憶體的分散是數據運算框架，但隨著硬體的升級，機器叢集也將產生異構性，因此各節點的派送也是問題之一。為了解決異構性的問題，吳秉諭學者提出CMM(Cluster Min-Max)架構，將效能相似計算能力的節點分群，再藉由一台進行排成調度，將任務派送給各群，藉此提升Spark運算時的效能，此外也利用Spark運行的參數、時間、資料大小等等數據，經由迴歸分析去每群運算的時間，再透過時間去進行排程，加速整體的運算速度，藉以實現高效的任務調度。CMM的研究為了實現客製化的群集，往往需要花費更多的時間去安裝軟體、管理群集的狀態、維護運機器服務，新增節點設定，因此很難去實行動態的擴展與動態設定。針對以上的問題，本論文提出了KCMM架構，希望透過Kubernetes以及Docker去進行環境的部屬，解決環境安裝、動態擴展，節點故障，以及測試動態提供資源。另外針對CMM的模型再組合出新的Spark運行參數，以及透過深度學習模型優化，藉此提升整體時間預測的精準度，和加速整體任務排成的速度，達到輕易部屬且有更好運算效能的叢集環境。

關鍵字

Apache Spark ； Kubernetes ； Docker ；任務排程調度；異構叢集；深度學習

並列摘要

In recent years, the popularity of big data, to meet more and more out of frame, including the Apache Spark the most well known, the Spark was originally developed by AMPLab at the university of California, Berkeley, and then will be donated to the Apache software foundation project in 2013, he is a dispersion is based on memory data operation framework, but along with the upgrading of hardware, machine cluster heterogeneity will be generated, so the delivery is one of the problems of each node. In order to solve the problem of heterogeneity, wu bing yu scholars put forward the CMM (Cluster Min - Max) architecture, the efficiency of the node based on similarity computing ability, to borrow, formed by a scheduling task will be sent to each group, thus increasing the Spark at the time of operation efficiency, moreover also USES Spark running parameters, time, and the size of the data, and so on, through the regression analysis to each group of operation time, again through time for scheduling, accelerate the speed of whole, so as to achieve efficient task scheduling. In order to achieve customized clustering, CMM research often needs to spend more time to install software, manage the state of the cluster, maintain the machine service, and add new node Settings, so it is difficult to implement dynamic expansion and dynamic Settings. In view of the above problems, this paper proposes the KCMM architecture, hoping to provide resources through Kubernetes and Docker for environment deployment, environment installation, dynamic extension, node failure, and dynamic test. In addition, new Spark operating parameters are combined with the CMM model, and the deep learning model is optimized to improve the accuracy of overall time prediction and speed up the overall task alignment, so as to achieve a cluster environment that is easily subordinate and has better computing efficiency.

並列關鍵字

Apache Spark ； Kubernetes ； Docker ； Task Scheduling and Dispatching strategy ； Heterogenic Clusters ； Deep Learning

參考文獻

[1] John Walker, S. (2014). Big data: A revolution that will transform how we live, work, and think.

Google Scholar

[2] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: cluster computing with working sets. HotCloud, 10, 10-10.

Google Scholar

[3] Alsheikh, M. A., Niyato, D., Lin, S., Tan, H. P., & Han, Z. (2016). Mobile big data analytics using deep learning and apache spark. IEEE network, 30(3), 22-29.

Google Scholar

[4] Chang, Victor, Yen-Hung Kuo, and Muthu Ramachandran. "Cloud computing adoption framework: A security framework for business clouds." Future Generation Computer Systems 57 (2016): 24-41.

Google Scholar

[5] Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Cambridge: MIT press.

Google Scholar

國際替代計量

以深度學習方法優化Apache Spark叢集任務排程之研究

未授權

主題瀏覽