基於 Spark 框架的排程感知資料預取機制

In-memory技術將經常被存取的資料放置在較快、較昂貴的儲存裝置中，在資料處理時提供更佳的效能。而資料預取目標在於透過將資料從不同種類的儲存裝置中移動，以達到效能與成本的要求。然而，現存的技術並未考慮以下兩個狀況。首先，對於並不會對相同資料進行多次存取的應用程式，要如何進行最佳化。接下來是在不影響正在運行的應用程式的狀況下，釋放記憶體資源。在這篇論文中，我們提出了基於 Spark 框架的排程感知資料預取機制（Scheduling-Aware Data Prefetching based on Spark Framework，SADP），包含了資料預取與資源回收機制。SADP 將即將使用到的資料預取至記憶體中，也將記憶體中的資源釋放以供其他資料使用。最後，在真實測試平台上的實驗數據也驗證了 SADP 的可行性。

關鍵字

In-memory技術； Spark ； Alluxio ；資料預取

並列摘要

In-memory techniques keep the data frequently used into faster and more expensive storage media for improving performance of data processing. Data prefetching aims to move data between difference storage media to meet requirements of performance and cost. However, exiting methods do not consider the following two problems. The first is how to benefit the data processing applications that do not frequently read the same data sets. The second is how to reclaim memory resources without affecting other running applications. In this paper, we provide a Scheduling-Aware Data Prefetching based on Spark Framework (SADP), which includes data prefetching and data eviction mechanisms. The SADP caches the data that would be used in near future, furthermore, evicts the data from memory to release resources for hosting other data blocks in memory. Finally, real-testbed experiments are performed to show the effectiveness of the proposed SADP.

並列關鍵字

In-memory techniques ； Spark ； Alluxio ； data prefetching

參考文獻

[1] Vernon Turner, “The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things,” EMC Corporation, April, 2014.

[7] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. Hellerstein, “Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud”, in Proceedings of VLDB Endowment, vol. 5, no. 8, pp. 716-727, 2012.

[18] R. Karp, “Reducibility among Combinatorial Problems,” Complexity of Computer Computations, pp. 85-103, 1972.

[2] H. Karau, A. Kowinski, and M. Hamstra, Learning Spark: Lightning-fast Big Data Analysis. O’Reilly Media, Inc, USA, 2015.

Google Scholar

[3] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica, “Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2-2, 2012.

Google Scholar

國際替代計量

基於 Spark 框架的排程感知資料預取機制

全文下載

主題瀏覽