透過實作資料局部性排程演算法優化Hadoop-MapReduce之效能

雲端運算越來越受到歡迎並持續於結構、網路以及軟體上發展。Hadoop-MapReduce使用分散式叢集來平行處理處理大量資料，是一個很常見的軟體架構。它裡面的處理節點可以擴充到一個相當大的數量，因此挾著強大的運算能力的Hadoop-MapReduce可以提供相當好的一個處理平台。而網路流量一直以來都是資料密集性運算最大的瓶頸，在資料平行系統對效能會造成顯著的影響。此網路瓶頸是網路頻寬所導致，使得網路速度比硬碟資料存取還要慢上許多。然而，好的資料局部性可以減少網路流量並使資料密集的HPC（High-performance computing）系統效能增加。不過Hadoop的排程在資源分配上有個缺乏考慮資料局部性的缺點，所以本論文提出了一個Hadoop-MapReduce位置感知排程演算法。首先我們提出了一個Hadoop排程的資料影響權重數學模型，其次，使用資料局部性排程演算法與資料影響權重搭配來提供位置感知的資源分配。最後建立三台安裝Xen Cloud Platform的實體機器，而每台實體機器上運行兩個裝有hadoop的虛擬機並使用模擬來驗證此演算法的效能。

關鍵字

雲端運算；資料局部性

並列摘要

Cloud computing has become more popular, and it has been continuously developed in architecture, software, and network. Hadoop-MapReduce is a common software framework processing parallelizable problem across big datasets using a distributed cluster. Cloud Hadoop-MapReduce can scale incrementally in the number of processing nodes. Hence, the Hadoop-MapReduce is designed to provide a processing platform with powerful computation. Network traffic is always a most important bottleneck in data-intensive computing and network latency decreases significant performance in data parallel systems. Network bottleneck is caused by network bandwidth and the network speed is much slower than disk data access. So that, good data locality can reduces network traffic and increases performance in data-intensive HPC systems. However, Hadoop’s scheduler has a defect of data locality in resource assignment. This paper includes a locality-aware scheduling algorithm for Hadoop-MapReduce scheduler. Firstly, we propose a mathematical model of weight of data interference in Hadoop scheduler. Secondly, we present the algorithm to use weight of data interference to provide data locality-aware resource assignment in Hadoop scheduler. Finally, we build an experimental environment with 3 physical machines which were installed Xen Cloud Platform and 2 virtual machines which are installed hadoop on each physical machine. Then, run simulation to verify the performance of locality-aware scheduling algorithm for Hadoop-MapReduce scheduler.

並列關鍵字

Hadoop ； MapReduce

參考文獻

[4] AJG Hey, S Tansley, KM Tolle, “The fourth paradigm: data-intensive scientific discovery,” iw.fh-potsdam.de, 2009

[5] Zhiyong Zhong, Shengzhong Feng, Bibo Tu and Jianping Fan, “Improving Data Locality of MapReduce by Scheduling in Homogeneous Computing Environments,” Parallel and Distributed Processing with Applications (ISPA), 2011

[11] B Hendrickson and TG Kolda, “Graph partitioning models for parallel computing,” Parallel computing, 2000

[12] Cheng T. Chu, Sang K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun, "Map-Reduce for Machine Learning on Multicore," in Proc. of Neural Information Processing Systems (NIPS), 2006.

[13] T. Tu, C. A. Rendleman, D. W. Borhani, R. O. Dror, J. Gullingsrud, M. O. Jensen, J. L. Klepeis, P. Maragakis, P. Miller, K. A. Stafford, and D. E. Shaw, "A Scalable Parallel Framework for Analyzing Terascale Molecular Dynamics Simulation Trajectories," in Proc. Of the ACM/IEEE Conference on Supercomputing, 2008

國際替代計量

透過實作資料局部性排程演算法優化Hadoop-MapReduce之效能

全文下載

主題瀏覽