基於 Spark 框架的排程感知資料預取機制

Translated Titles

Scheduling-Aware Data Prefetching Based on Spark Framework





Key Words

In-memory技術 ; Spark ; Alluxio ; 資料預取 ; In-memory techniques ; Spark ; Alluxio ; data prefetching



Volume or Term/Year and Month of Publication


Academic Degree Category




Content Language


Chinese Abstract

In-memory技術將經常被存取的資料放置在較快、較昂貴的儲存裝置中,在資料處理時提供更佳的效能。而資料預取目標在於透過將資料從不同種類的儲存裝置中移動,以達到效能與成本的要求。然而,現存的技術並未考慮以下兩個狀況。首先,對於並不會對相同資料進行多次存取的應用程式,要如何進行最佳化。接下來是在不影響正在運行的應用程式的狀況下,釋放記憶體資源。在這篇論文中,我們提出了基於 Spark 框架的排程感知資料預取機制(Scheduling-Aware Data Prefetching based on Spark Framework,SADP),包含了資料預取與資源回收機制。SADP 將即將使用到的資料預取至記憶體中,也將記憶體中的資源釋放以供其他資料使用。最後,在真實測試平台上的實驗數據也驗證了 SADP 的可行性。

English Abstract

In-memory techniques keep the data frequently used into faster and more expensive storage media for improving performance of data processing. Data prefetching aims to move data between difference storage media to meet requirements of performance and cost. However, exiting methods do not consider the following two problems. The first is how to benefit the data processing applications that do not frequently read the same data sets. The second is how to reclaim memory resources without affecting other running applications. In this paper, we provide a Scheduling-Aware Data Prefetching based on Spark Framework (SADP), which includes data prefetching and data eviction mechanisms. The SADP caches the data that would be used in near future, furthermore, evicts the data from memory to release resources for hosting other data blocks in memory. Finally, real-testbed experiments are performed to show the effectiveness of the proposed SADP.

Topic Category 電機資訊學院 > 電機工程學研究所
工程學 > 電機工程
  1. [1] Vernon Turner, “The Digital Universe of Opportunities: Rich Data and the Increasing Value of the Internet of Things,” EMC Corporation, April, 2014.
  2. [7] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. Hellerstein, “Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud”, in Proceedings of VLDB Endowment, vol. 5, no. 8, pp. 716-727, 2012.
  3. [18] R. Karp, “Reducibility among Combinatorial Problems,” Complexity of Computer Computations, pp. 85-103, 1972.
  4. [2] H. Karau, A. Kowinski, and M. Hamstra, Learning Spark: Lightning-fast Big Data Analysis. O’Reilly Media, Inc, USA, 2015.
  5. [3] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica, “Resilient Distributed Datasets: A Fault-tolerant Abstraction for In-memory Cluster Computing,” in Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, pp. 2-2, 2012.
  6. [4] H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica, “Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks,” in Proceedings of the ACM Symposium on Cloud Computing, 2014.
  7. [5] “Apache Spark™ - Lightning-Fast Cluster Computing,” 2016. [Online]. Available: http://spark.apache.org. [Accessed: 06- Jul- 2016].
  8. [6] X. Shi, M. Chen, L. He, X. Xie, L. Lu, H. Jin, Y. Chen, and S. Wu., “Mammoth: Gearing Hadoop Towards Memory-Intensive MapReduce Applications,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 8, pp. 2300-2315, Aug. 2015.
  9. [8] B. Ooi, Y. Wang, Z. Xie, M. Zhang, K. Zheng, K. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, and A. Tung, “SINGA: A Distributed Deep Learning Platform,” in Proceedings of the 23rd ACM International Conference on Multimedia, 2015.
  10. [9] “Apache Storm,” 2016. [Online]. Available: http://storm.apache.org/. [Accessed: 06- Jul- 2016].
  11. [10] L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed Stream Computing Platform,” in IEEE International Conference on Data Mining Workshops, Sydney, NSW, 2010, pp. 170-177.
  12. [11] “Alluxio - Open Source Memory Speed Virtual Distributed Storage
  13. [12] R. Gandhi, A. Gupta, A. Povzner, W. Belluomini, and T. Kaldewey, “Mercury: Bringing Efficiency to Key-value Stores,” in Proceedings of the 6th International Conference on Systems and Storage, 2013.
  14. [13] H. Lim, D. Han, D. Andersen, and M. Kaminsky, “MICA: A Holistic Approach to Fast In-memory Key-value Storage,” in Proceedings of the 11th USENIX Conference on Networked Systems Design and Implementation, pp. 429-444, 2014.
  15. [14] C. Mitchell, Y. Geng, and J. Li, “Using One-sided RDMA Reads to Build a Fast, CPU-efficient Key-value Store,” in Proceedings of the 2013 USENIX Conference on Annual Technical Conference, pp. 103-114, 2013.
  16. [15] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. Joseph, R. Katz, S. Shenker, and I. Stoica, “Mesos: a Platform for Fine-grained Resource Sharing in the Data Center," in Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, pp. 295-308, 2011.
  17. [16] V. Vavilapalli, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, E. Baldeschwieler, A. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, and H. Shah, “Apache Hadoop YARN,” in Proceedings of the 4th Annual Symposium on Cloud Computing, 2013.
  18. [17] “Apache Hadoop 2.7.2 – HDFS Users Guide,” 2016. [Online]. Available: https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html. [Accessed: 06- Jul- 2016].
  19. [19] “XenServer
  20. [20] “Examples