雲端運算中以MJR模型為基礎之結合運算研究

大規模資料(large-scale data)分析是雲端運算中一個重要的問題，尤其是對大規模資料進行表格結合運算(Join)。表格結合運算包含相等結合(Equi-Join)與不等結合(NonEqui-Join)兩種表格結合運算。而在執行複雜的資料分析任務的雲端運算環境中，MapReduce架構提供一個新的運算模式，能將資料分析任務分配給不同機器平行執行，最後再合併運算結果。為了改善MapReduce處理大規模資料分析任務的效能，學者們提出Map-Join-Reduce應用程序介面來改善MapReduce處理相等結合時資料分析任務的效能，但它無法支援NonEqui-Join。本文提出的方法，是以Map-Join-Reduce架構為基礎，除了改善MapReduce處理相等之效能外，並支援不相等結合任務。本文方法主要有三個步驟。第一是先篩選資料；依查詢(query)結果篩選(Filter)資料，讓特定範圍條件資料或指定單一條件資料篩選出來。其次是執行表格結合運算；為了加快速度本文使用平行概念，讓表格結合運算分配至數個Workers中並行執行。最後步驟是收集並整合資料；在收集每個Worker表格結合運算後的資料並執行合計運算(Aggregation)，讓資料執行查詢結果中所選擇呈現格式(select)的運算式。最後針對複雜的表格結合運算本文方法使用BloomFilter來提高效能。

關鍵字

雲端運算； MapReduce ； MJR API ；結合運算；大規模資料

並列摘要

Large data analysis is an important topic in the area of cloud computing. Such data analysis usually requires complex processing, such as Theta-Join query processing. Theta-join includes two operations, equi-join and nonequi-join. MapReduce is an important programming framework in cloud for parallel data computing. In order to improve performance of MapRduce for complex data analysis, researchers propose the Map-Join-Reduce API to support the equi-join operation. However, the Map-Join-Reduce API cannot support nonequi-join operations. In this thesis, we proposed a method for extending the Map-Join-Reduce framework to support nonequi-join operation. There are three main concepts in the proposed method. First, data are filtered according to the query statements. Second, the filtered data are sent to its corresponding worker according to the join expression for higher level parallelism. Each worker then performs the corresponding join operation after receiving the filtered data. Third, we aggregate the result by using aggregate functions specified in the select clause. Finally, we adopt BloomFilter to improve performance for complex join table operation.

並列關鍵字

Cloud computing ； MapReduce ； MJR API ； join operation ； large scale data

參考文獻

[7] B. H. Bloom, “Space/Time Trade-Offs in Hash Coding with Allowable Errors, “ in Communications of the ACM, Vol. 13, No.7, pp. 422‒426, 1970.

[11] I. Foster and C. Kesselman, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann Publishers, 1998.

[12] A. Floratou, J. M. Patel, E. J. Shekita and S. Tata, “Column-Oriented Storage Techniques for MapReduce,” in Proceedings of the VLDB Endowment, Vol. 4, No.7, pp. 419-429, 2011.

[14] R. Grossman, Y. Gu, M. Sabala and W. Zhang, “Compute and Storage Clouds Using Wide Area High Performance Networks,” Future Generation Computer Systems, Vol. 25, No. 2, pp. 179–183, 2008.

[16] H. Han, H. Jung, H. Eom and H. Y. Yeom, “Scatter-Gather-Merge: An Efficient Star-Join Query Processing Algorithm for Data-Parallel Frameworks,” Cluster Computing, Vol. 14, No. 2, pp. 183–197, 2010.

國際替代計量

雲端運算中以MJR模型為基礎之結合運算研究

全文下載

主題瀏覽