用於計算分散式資料庫上雙重半加入查詢的演算法

在分散式資料庫系統上的查詢常會牽涉大量的資料。因此計算查詢的演算法的效能是相當重要的。Spark SQL 是一個 Apache Spark 上的模組，它是一個提供高效率計算查詢演算法的引擎。在本論文中，我們提出在分散式資料庫系統上計算查詢的演算法針對：兩個 semi join 的聯集、兩個 semi join 的交集、兩個 semi join 的差集、以及兩個 anti join 的交集。我們將演算法在 Spark 上實作，相對於Spark SQL 我們得到了一些進步。

關鍵字

關聯式查詢；分散式資料庫系統

並列摘要

Querying information in a distributed database system often involves alarge amount of data. Thus, the efficiency of the query evaluation algorithmis very important. Spark SQL is a module in Apache Spark that provides anefficient query evaluation engine for SQL queries on top of Spark. In this thesis we propose an algorithm for evaluating a combination of two semi join queries in distributed database system. The combination is the union, intersection, subtraction of two semi join queries. We also consider the intersection of two anti join queries. We implement our algorithm in Spark and we demonstrate that our algorithm offers some improvement over Spark SQL.

並列關鍵字

Relational Query ； Distributed Database System ； Spark SQL

參考文獻

[1] Big data trends. https://www.datamation.com/big-data/big-data-trends.html. Accessed: 2018-05-14.

Google Scholar

[2] Code of balance join. https://point.csie.ntu.edu.tw/group_Lab401/project hash_join. Accessed: 2018-06-21.

Google Scholar

[3] Companies using apache spark. https://idatalabs.com/tech/products/ apache-spark. Accessed: 2018-05-14.

Google Scholar

[4] Spark 101: What is it, what it does, and why it matters. https://mapr.com/blog/ spark-101-what-it-what-it-does-and-why-it-matters/. Accessed: 2018-05-14.

Google Scholar

[5] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY, USA, 2015. ACM.

Google Scholar

國際替代計量

用於計算分散式資料庫上雙重半加入查詢的演算法

全文下載

主題瀏覽