大規模資料(large-scale data)分析是雲端運算中一個重要的問題,尤其是對大規模資料進行表格結合運算(Join)。表格結合運算包含相等結合(Equi-Join)與不等結合(NonEqui-Join)兩種表格結合運算。而在執行複雜的資料分析任務的雲端運算環境中,MapReduce架構提供一個新的運算模式,能將資料分析任務分配給不同機器平行執行,最後再合併運算結果。為了改善MapReduce處理大規模資料分析任務的效能,學者們提出Map-Join-Reduce應用程序介面來改善MapReduce處理相等結合時資料分析任務的效能,但它無法支援NonEqui-Join。本文提出的方法,是以Map-Join-Reduce架構為基礎,除了改善MapReduce處理相等之效能外,並支援不相等結合任務。本文方法主要有三個步驟。第一是先篩選資料;依查詢(query)結果篩選(Filter)資料,讓特定範圍條件資料或指定單一條件資料篩選出來。其次是執行表格結合運算;為了加快速度本文使用平行概念,讓表格結合運算分配至數個Workers中並行執行。最後步驟是收集並整合資料;在收集每個Worker表格結合運算後的資料並執行合計運算(Aggregation),讓資料執行查詢結果中所選擇呈現格式(select)的運算式。最後針對複雜的表格結合運算本文方法使用BloomFilter來提高效能。
Large data analysis is an important topic in the area of cloud computing. Such data analysis usually requires complex processing, such as Theta-Join query processing. Theta-join includes two operations, equi-join and nonequi-join. MapReduce is an important programming framework in cloud for parallel data computing. In order to improve performance of MapRduce for complex data analysis, researchers propose the Map-Join-Reduce API to support the equi-join operation. However, the Map-Join-Reduce API cannot support nonequi-join operations. In this thesis, we proposed a method for extending the Map-Join-Reduce framework to support nonequi-join operation. There are three main concepts in the proposed method. First, data are filtered according to the query statements. Second, the filtered data are sent to its corresponding worker according to the join expression for higher level parallelism. Each worker then performs the corresponding join operation after receiving the filtered data. Third, we aggregate the result by using aggregate functions specified in the select clause. Finally, we adopt BloomFilter to improve performance for complex join table operation.