Load Balancing and Data Placement in MapReduce

在大數據的時代，每天無所不在且無時無刻的產生大量的結構化和非結構化數據。處理大數據最困難的工作是需要同時在許多的電腦中平行執行軟體。 MapReduce是一個新的程式架構，對於處理大數據時他可簡易的將程式分散在各電腦中執行。 MapReduce 是利用電腦的網路連接的性質，將工作在電腦之間做平行處理。雲端計算往往是與MapReduce交互使用的概念。雲端計算是一種計算資源在網上互聯的計算方法。對於處理大數據而言，雲端計算與MapReduce架構的結合是一個完美的組合，MapReduce架構提供了一個既靈活且有彈性的硬體平台和結構，而雲端計算提供了MapReduce所需的密集型數據計算的資源。本文著眼於在群集或數據中心的資源利用率，並專注於負載平衡和數據在MapReduce放置。負載平衡是一種跨多台計算機或群集電腦的分散工作量的方法，並涉及到中央處理單元、硬碟和其他資源。負載平衡的目的是為了實現最佳的資源利用率，最大限度地提高處理能力，減少反應時間，並避免過載。在工作的期間，數據放置的位置與負載平衡是密切相關的。在這篇論文中，我們研究的負載平衡和MapReduce的數據佈局，以及它是如何在不同的情況下使用。然而，我們提出幾種演算法來提升MapReduce的負載均衡。首先，我們介紹一種能改善對全局排序(Total Order Partitioner)的抽樣方法。其次，我們提出一個使用動態資料重新配置(dynamic data redistribution)的多表格連結的演算法(multiway join algorithm)。最後我們展現了如何讓數據可以更智能地在異構雲端環境中的分佈。在我們的第一個主題，我們提出對目前的XTrie，ETrie和ATrie演算法對資料去做改進的抽樣方法和配置方法。該XTrie演算法使用固定的記憶體使用量，這不像是傳統的全局排序(total order partitioner)法把每個樣本都儲存在記憶體中。除此之外，我們的XTrie具有更好的性能，且XTrie對200,000樣本能有快七倍的執行時間。使用XTrie也改善ETrie和ATrie演算法對記憶體的需求，對ETrie而言用，XTrie可減少記憶體達到1/16。最後藉由使用在出生日期數據的自適應方法(adaptive method) 和一個2級的線索，可讓使用XTrie的ATrie減少記憶體到達1/16384。在我們的第二個主題中，我們提出了一個MapReduce的網絡感知多表格連結（SmartJoin）提高性能，並在重新分配工作負載於reducer時考慮網絡流量。我們發現這可以減少連結多個數據組所需的時間。在我們的評測中，證明SmartJoin的效能比未重新分配的方法(non-redistribution method)好39%，比隨機重新分配方法(random redistribution)好26.8%，比worst join重新分配方法好27.6%。最後，在我們的第三個主題中，我們對MapReduce佈署在異質雲環境提出了一個動態資料分配方法和虛擬機映射器。模擬和實驗結果證明，能提高MapReduce的性能，對數據局部性的部分改善了33％且提升最佳總完工時間41％。此外，藉由使用負載感知虛擬機映射器(virtual machine mapper)可在reduce階段額外減少13%的完成時間。

關鍵字

MapReduce ；大數據；雲端計算；分散式計算；異質性；集群；執行時間最佳化；加快數據密集型計算；數據中心；資源管理

並列摘要

In the era of Big Data, huge amounts of structured and unstructured data are being produced daily by a myriad of ubiquitous sources. Big Data is difficult to work with and requires massively parallel software running on a large number of computers. MapReduce is a recent programming model that simplifies writing distributed applications that handle Big Data. MapReduce does this by dividing its workload amongst computers in a network and then processing the work in parallel. Often intertwined with the concept of MapReduce is Cloud Computing. Cloud Computing is a method of computing that shares computing resources over the Internet. Cloud Computing is a perfect match for handling Big Data and the MapReduce model by providing a hardware platform and framework that is flexible and elastic. Cloud Computing provides the resources needed for the data intensive computation required by MapReduce and its data. This dissertation looks at resource utilization within a cluster or data center and focuses on load balancing and data placement in MapReduce. Load balancing is a method to distribute workload across multiple computers or a computer cluster and involves central processing units, disk drives, and other resources. The purpose of load balancing is to achieve optimal resource utilization, maximize throughput, minimize response time, and avoid overload. Data placement is closely related to load balancing and refers to where data is in the network during the lifetime of a job. In this dissertation, we investigate load balancing and data placement for MapReduce and how it applies in different situations. We then propose several algorithms that can improve load balancing for MapReduce. Firstly we introduce an improved sampling method for total order partitioning. Secondly we present a multiway join algorithm that uses dynamic data redistribution and finally we show how data can be more intelligently distributed within a heterogeneous cloud environment. In our first topic, we propose an improved sampling and partitioning method for strings. In this topic, we present the XTrie, ETrie and ATrie algorithms. The XTrie algorithm uses a fixed memory footprint, which is unlike the traditional total order partitioning method that stores all elements in a sample set in memory. Furthermore, we show XTrie has better performance, and is able to execute 7 times faster on 200,000 samples. Both ETrie and ATrie algorithms further improve the memory requirements used by XTrie. ETrie was able to reduce memory consumption to 1/16 of that used by XTrie. Finally, ATrie was able to reduce memory consumption by 1/16384 of that used by XTrie, by using an adaptive method on birthdate data and a 2-level trie. In our second topic, we propose a network aware multiway join for MapReduce (SmartJoin) that improves performance and considers network traffic when redistributing workload amongst reducers. We show this can reduce the time required to join multiple datasets. In our evaluation, we show that SmartJoin has up to 39% improvement compared to the non-redistribution method, a 26.8% improvement over random redistribution and 27.6% improvement over WorstJoin redistribution. Finally, in our third topic we propose a dynamic data partitioner and virtual machine mapper for MapReduce when deployed on a heterogeneous cloud environment. Simulation and experimental results show an improvement in MapReduce performance, improving data locality by 33% and optimizing total completion time by 41%. Furthermore, by using the Load Aware Virtual Machine Mapper obtained an additional 13% improvement in reduce phase completion time.

並列關鍵字

MapReduce ； BigData ； Cloud Computing ； Distributed Computing ； Heterogeneity ； Cluster ； Runtime Optimization ； Accelerating Data Intensive Computation ； Data Center ； Resource Management

參考文獻

[2] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon, "Parallel data processing with MapReduce: a survey," ACM SIGMOD Record, vol. 40, pp. 11-20, 2012.

[3] J. Dittrich and J.-A. Quiané-Ruiz, "Efficient big data processing in Hadoop MapReduce," Proceedings of the VLDB Endowment, vol. 5, pp. 2014-2015, 2012.

[4] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," in ACM SIGOPS Operating Systems Review, 2003, pp. 29-43.

[5] T. White, Hadoop: The definitive guide: O'Reilly Media, Inc., 2012.

[7] Hadoop Available: http://hadoop.apache.org/, Accessed:(Oct. 1, 2013)

國際替代計量

Load Balancing and Data Placement in MapReduce

主題瀏覽