MapReduce於非關聯式資料庫之結構化資料處理

近年來隨著資料的快速成長，資料儲存和處理也日漸重要。不論是由傳統關聯式資料庫系統觀點出發或是最近火紅的非關聯式資料庫，皆尋求一個具有延展性的解決方法來處理這些大量資料。由Google發表幾項其公司內部所使用的分散式儲存、處理的相關文獻開始，此領域的相關研究更是邁入了一個新的格局。MapReduce即是當中相當重要的技術之一，用來有效率平行地處理大量資料，且具有幾項吸引人的特色，例如: 延展性與容錯性。近期的幾項相關研究，其中不少皆使用MapReduce支持了傳統資料庫系統上所使用的SQL或類SQL查詢語法。這些研究大部分專注於Hadoop分散式系統之上，然而在企業界中，不乏一些需要頻繁地更動資料庫中的資料，因此我們不但像HBase般具有更好延展性的系統來儲存資料，而且能通透化地操作處理資料。相較於由關聯式資料庫所發展出來的平行關聯式資料庫，此篇論文於非關聯式資料庫之上，提出一個系統化的方法來處理結構化資料。我們將原來儲存於關聯式資料上的資料轉存至非關聯式資料庫，且採用MapReduce實作出一套以SQL處理資料的邏輯，用以處理儲存於非關聯式資料庫之上的資料。最後藉實驗結果，驗證此方法可提供像傳統使用方式，但卻具延展性且有效處理大量資料。

關鍵字

雲端運算；結構化資料處理； MapReuce ；非關聯式資料庫

並列摘要

As the rapidly data exploration in recent years, data store and processing are getting more attentions to extract the important information. To find a scalable solution to process the large scale data is a critical issue in either the relational data base system or the emerging NoSQL database. Since Google published some techniques they have successfully operated in their corporation, a great impact was given on the literature of distributed data store and processing such that a brand new paradigm was step forwarded; so-called Cloud Computing. MapReduce is one of the critical techniques to process the massive data in parallel. With the inherent scalability and fault-tolerance, MapReduce is attractive to the large-scale data processing. Using MapReduce to support the SQL or SQL-like queries has been presented in several studies. Most of the previous works focus on the Hadoop distributed file system. However, from the view point of some enterprises, the data resided in a database may be frequently changed as the update occurs. Accordingly, we need a flexible data store as Bigtable or HBase not only to place the data over a scale-out storage system, but also to manipulate the changeable data in a transparent way. In this thesis, we propose a systematical method using MapReduce for the structured data processing in NoSQL database. We exploit the HBase as the underlying NoSQL database to analyze some major manipulation languages of the ANSI SQL and provide the corresponding queries to manipulate the data residing in the NoSQL database. To organize the data with less complexity, we also introduce a remapping strategy to translate the data model from the relational database to the NoSQL database. Experimental results show that our approaches can outperform the conventional approach in terms of the efficiency and the scalability in large scale data sets.

並列關鍵字

Cloud computing ； Structured data processing ； MapReduce ； NoSQL database

參考文獻

[1] Hadoop. Available: http://hadoop.apache.org

[4] Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin, "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads," Proc. VLDB Endow., vol. 2, no. 1, 2009, pp. 922-933.

[8] J.-D. Cryans, A. April, and A. Abran, "Criteria to Compare Cloud Computing with Current Database Technology," In Conf. on Software Process and Product Measurement, 2008.

[14] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins, "Pig latin: a not-so-foreign language for data processing," presented at the Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Vancouver, Canada, 2008, pp. 1099-1110.

[15] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker, "A comparison of approaches to large-scale data analysis," presented at the Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, Providence, Rhode Island, USA, 2009, pp. 165-178.

國際替代計量

MapReduce於非關聯式資料庫之結構化資料處理

主題瀏覽