評估與改善資料網格、YouTube與Hadoop YARN工作執行與資料傳輸效能之研究

近年來有許多不同的分佈式計算與儲存系統，例如：資料網格 (data grid)、YouTube 與Hadoop YARN系統，分別被提出與被用來解決複雜科學計算與儲存、分享視訊以及處理龐大資料與多種應用程式 (applications)之問題。因此，在上述之系統中頻寬耗費與工作執行之效能是為重大之議題。其中，在資料網格中許多專家學者提出不同的資料複製演算法來縮短在資料網格中之檔案傳遞時間，進而改善資料存取效能與減少頻寬耗費，但是目前所提出之資料複製演算法都沒有考慮到使用者對數據檔案存取之行為 (data access pattern)，因此導致這些複製演算法都沒有辦法有效改善檔案傳遞時間、資料存取效能與頻寬耗費。而YouTube利用了分佈式快取記憶體 (Memcached) 來加快整個視訊的搜尋與存取，當Memcached儲存空間已滿時，YouTube將採用最久未使用快取置換機制演算法 (least-recently-used cache replacement algorithm)，即以最久沒有被存取之視訊作為替換。然而，YouTube採用LRU進行視訊置換有可能導致增加網路頻寬之耗費與延長視訊取得之時間。另一方面，新一代的Hadoop YARN系統本身提供了不同的排班演算法，來執行多種不同之應用程式排班，進而達到高資源利用率與公平分享資源之原則。然而，在Hadoop YARN中，不同的排班演算法在不同之佇列結構下，執行混合多種不同應用程式時之執行效能，並沒有詳細地被評估。因此，為了解決上述問題，本博士論文中我們將先提出一個資料複製演算法稱為Popular File Replication First data replication algorithm (簡稱為PFRF)，考慮了使用者對數據檔案存取之行為，進而去改善資料網格中之檔案傳遞時間、檔案可用性與頻寬耗費。針對YouTube，我們也將提出兩種以Pareto 原理為基礎之快取置換機制演算法，來減少視訊置換之失誤次數與縮短視訊取得之時間，進而改善視訊存取效能與減少頻寬之耗費。而針對Hadoop YARN，我們將Hadoop YARN中所支援的不同排班算法之主要特性再分類成四種排班混合策略 (scheduling-policy combinations)，並且針對在Hadoop YARN中可以執行的應用程式加以分類，最後再進行評估這四種排班混合策略在不同之佇列結構下執行多種不同混合應用時之執行效能。

關鍵字

資料網格； Hadoop YARN ； YouTube ；檔案之存取行為；分佈式快取記憶體； Pareto 原理；最久未使用快取置換機制演算法；排班策略；效能評估

並列摘要

Recently, several different types of distributed computation and storage systems, such as data grids, YouTube, and Hadoop YARN system, have been widely employed around the world to respectively resolve complex scientific computation and storage problems, enable people to share videos, and process large scale of data and applications. In the above systems, bandwidth consumption and job execution performance are very important two issues. In data grids, several data replication algorithms have been proposed to shorten file transmission time, improve data access performance, and reduce bandwidth consumption. But none of them considers data access patterns, i.e., users’ access behaviors, which causes that data grids has longer data transmission delays and higher bandwidth consumption. YouTube utilizes a distributed memory caching scheme named Memcached to cache videos, and employs the least-recently-used (LRU for short) cache replacement algorithm to evict videos when Memcached runs out of space. However, LRU might increase network overhead and video retrieval time. On the other hand, Hadoop YARN provides several scheduling policies and supports queue hierarchy, while the corresponding impacts on different types of applications that are executable on Hadoop YARN are unknown. In order to solve the aforementioned problems, in this dissertation, we propose a Popular File Replicate First algorithm (PFRF for short) considered user access behavior to improve job turnaround time, data availability, and bandwidth cost in data grids. Next, we propose two Pareto-based algorithms for YouTube to reduce video fault and shorten video-retrieval time so that the network overhead and video retrieval time in YouTube can be improved. Finally, we study how the scheduling-policy combinations (SPCs for short) supported by Hadoop YARN with several different queue structures impact the performance of various types of applications.

並列關鍵字

data grid ； Hadoop YARN ； YouTube ； data access patterns ； Memcached ； Pareto principle ； least-recently-used replacement algorithm ； scheduling-policy ； performance evaluation

參考文獻

[60] H. Mohamed, and D. Epema, “Koala: A Co-Allocating Grid Scheduler,” Concurrency and Computation: Practice and Experience, vol. 20, no. 16, pp. 1851–1876, 2008.

[96] A. Wiggins and J. Langston, “Enhancing the Scalability of Memcached,” http://software.intel.com/sites/default/files/m/0/b/6/1/d/45675-memcached_05172012.pdf

[1] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, “The data grid: towards an architecture for the distributed management and analysis of large scientific datasets,” Journal of Network and Computer Applications, 2000.

[4] V.P. Holmes, W.R. Johnson, and D.J. Miller, “Integrating web service and grid enabling technologies to provide desktop access to high-performance cluster-based components for large-scale data services,” Proceedings of 36th Annual Simulation Symposium, pp. 167–174, 2003.

[5] I. Foster, C. Kesselman, and S. Tuecke, “The anatomy of the grid: enabling scalable virtual organizations,” International Journal of High Performance Computing Applications, vol. 15, pp. 200–222, 2001.

國際替代計量

評估與改善資料網格、YouTube與Hadoop YARN工作執行與資料傳輸效能之研究

全文下載

主題瀏覽