以查核點提升Hadoop雲端計算系統容錯效能之研究

雲端計算中大規模數據密集型的MapReduce計算模組在近幾年來日益普及。Hadoop是用來實現MapReduce的雲端開源平台，它可以輕易且迅速的建立一個龐大的商用計算集群。在這種大型集群中，運算任務故障或運算節點故障並非是一種異常的情形，但是這些故障對於Hadoop的性能而言，這將會導致非常重大的影響。雖然Hadoop可以自動重新啟動失敗的任務並透過使用Speculative Execution來自動補償緩慢任務，但仍有許多研究人員發現了Hadoop容錯方面的缺點。在此研究中，我們探討當Hadoop在執行MapReduce運算時，如何以增加系統容錯能力的方式來減少因為錯誤恢復時所導致運算完成時間的延長與整體性能下降的問題。我們嘗試藉由設計一個簡單的Map任務查核點機制去改善這個問題，當該機制啟動時，透過雲端系統回傳的Progress與Heartbeat來獲得輸入資料區塊的執行進度，當輸入資料區塊處理進度達某特定百分比時，Mapper將會創建一個查核點來儲存Mapper執行時所產生的中繼資料。而一旦查核點建立後，若Mapper發生故障，Mapper則可以直接從查核點之後的進度開始執行而不需要將任務重頭開始執行。另外在加快錯誤恢復速度方面，在發生運算節點故障的情況下，我們利用移動TaskTracker的方式使得運算節點具有Data locality的性質，以節省輸入資料區塊搬移的時間。萬一具有輸入資料區塊的節點都在忙碌時，我們則優先選擇具有Rack locality性質的節點來執行任務的複製與移動，如此亦可加快錯誤恢復的速度。經由大量的模擬，我們發現我們提出的方法雖然需要花費更多的儲存空間與網路流量的成本，但相較於原始的Hadoop在任務完成時間方面，我們的方法表現出了更好的性能。

關鍵字

Hadoop ； MapReduce ；查核點；中繼資料；資料在地化

並列摘要

The computing paradigm of MapReduce has gained extreme popularity in the area of large-scale data-intensive applications in recent years. Hadoop, an open-source implementation of MapReduce, can be set up easily and rapidly on commodity hardware to form a massive computing cluster. In such a cluster, task failures and node failures are not an anomaly, which will cause a substantial impact on Hadoop’s performance. Although Hadoop can restart failed tasks automatically and compensate for slow tasks by enabling speculative execution, many researchers have identified the shortcomings of Hadoop’s fault tolerance. In this research, we try to improve them by designing a simple checkpointing mechanism for Map tasks. When the mechanism is enabled, a checkpoint will be created for a mapper when the progress of processing the input data block reaches a certain percentage. Once a mapper fails after the progress of the checkpoint state, it can resume from the checkpoint state without having to restart from scratch. By extensive simulations, the proposed approach shows better performance than native Hadoop in terms of job completion time, at the cost of more storage space and network traffic.

並列關鍵字

Hadoop ； MapReduce ； Checkpoint ； Intermediate Data ； Data locality

參考文獻

[5] S.Y. Ko, I. Hoque, B. Cho, and I. Gupta, ¡§On Availability of Intermediate Data in Cloud Computations,¡¨ the USENIX Workshop on Hot Topics in Operating Systems (HotOS), 2009.

[6] M. R. Lyu, ¡§Handbook of Software Reliability Engineering,¡¨ McGraw-Hill, New York, 1996.

[7] W.-T. Tsai, X. Zhou, Y. Chen, and X. Bai, ¡§On testing and evaluating service-oriented software,¡¨ IEEE Computer, vol. 41, no. 8, pp. 40-46, 2008.

[8] M. R. Lyu, ¡§Software Fault Tolerance,¡¨ Trends in Software, Wiley, 1995.

[9] Z. Zheng and M. R. Lyu, ¡§A distributed replication strategy evaluation and selection framework for fault tolerant web services,¡¨ International Conference on Web Services, (ICWS¡¦08), pp. 145-152, 2008.

國際替代計量

以查核點提升Hadoop雲端計算系統容錯效能之研究

全文下載

主題瀏覽