透過您的圖書館登入
IP:18.225.98.111
  • 學位論文

用於計算分散式資料庫上雙重半加入查詢的演算法

Evaluating Combinations of Two Semi-join Queries in Distributed System

指導教授 : 陳偉松

摘要


在分散式資料庫系統上的查詢常會牽涉大量的資料。因此計算查詢的演算法的效能是相當重要的。Spark SQL 是一個 Apache Spark 上的模組,它是一個提供高效率計算查詢演算法的引擎。 在本論文中,我們提出在分散式資料庫系統上計算查詢的演算法針對:兩個 semi join 的聯集、兩個 semi join 的交集、兩個 semi join 的差集、以及兩個 anti join 的交集。我們將演算法在 Spark 上實作,相對於Spark SQL 我們得到了一些進步。

並列摘要


Querying information in a distributed database system often involves alarge amount of data. Thus, the efficiency of the query evaluation algorithmis very important. Spark SQL is a module in Apache Spark that provides anefficient query evaluation engine for SQL queries on top of Spark. In this thesis we propose an algorithm for evaluating a combination of two semi join queries in distributed database system. The combination is the union, intersection, subtraction of two semi join queries. We also consider the intersection of two anti join queries. We implement our algorithm in Spark and we demonstrate that our algorithm offers some improvement over Spark SQL.

參考文獻


[1] Big data trends. https://www.datamation.com/big-data/big-data-trends.html. Accessed: 2018-05-14.
[2] Code of balance join. https://point.csie.ntu.edu.tw/group_Lab401/project hash_join. Accessed: 2018-06-21.
[3] Companies using apache spark. https://idatalabs.com/tech/products/ apache-spark. Accessed: 2018-05-14.
[4] Spark 101: What is it, what it does, and why it matters. https://mapr.com/blog/ spark-101-what-it-what-it-does-and-why-it-matters/. Accessed: 2018-05-14.
[5] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY, USA, 2015. ACM.

延伸閱讀