透過您的圖書館登入
IP:3.139.70.131
  • 學位論文

以非阻礙式訊息紀錄協定實作MPI-Based容錯中介軟體

Implementation of MPI-Based Fault Tolerant Middleware with Non-Blocking Message Logging Protocol

指導教授 : 郭斯彥

摘要


近年來平行式計算已成為提高電腦計算效能的主要方式之一。高效能的平行式電腦可被應用在商業、國防、科學等不同的領域。在科學上,高效能的計算提供數值模擬一個很大的助力。而數值模擬則是促進當代科學進步的一個重要方法。 許多人已開始研究與發展用來實行平行式計算的分散式系統。要設計一個分散式系統是複雜而困難的。在許多值得詳細規畫與設計的特性當中,容錯是一個重要的目標。分散式系統內的每台電腦都有可能產生錯誤。容錯的能力即在於處理系統內發生的錯誤。如何讓系統在執行時不受錯誤影響,是容錯技巧上值得研究的課題。 容錯的方式基本上分為檢查點與訊息紀錄兩種方式,這兩種方式也各自發展出不同形式的演算法。但至目前為止,並沒有一種演算法是公認有最佳效率的。在不同的環境或不同的狀況下我們要選擇不同的演算法以獲得最佳效率。 本論文的目標在於分析現今以MPI架構的分散式系統上,使用不同容錯方式的差異。實作出以MPI環境為主的非阻斷式訊息紀錄容錯中介軟體,測量其效能並分享實作經驗。

並列摘要


In recent years, parallel computing is one of the main ways to increase computer performance. High performance parallel computers apply to the fields of commerce, defense, and science, where high performance computing benefits numerical simulations, a major way to accelerate improvement of the current science. Many people begin to research and develop distributed systems which perform parallel computing. To design a distributed system is complicated and difficult. Fault tolerance is an important indicator in many characteristics worthy to be particularly designed. Although every computer in a distributed system may fail, fault tolerance has the capability to deal with the failures in the system. Thus, how to make a system free from failures when in executing is an important study in fault tolerance. The methods of rollback recovery are divided into checkpoint and message log. These two methods have different algorithms. Until now, no algorithm is admittedly the most efficient. Thus, we have to choose a different algorithm in different environments or circumstances to get the best efficiency. This goal of this paper is to discuss the differences in fault tolerance methods in MPI-based distributed system. We implement a MPI-based fault tolerant middleware with non-blocking message logging protocol, measure its performances, and share practical experience with others.

並列關鍵字

parallel computing fault tolerance MPI checkpoint message log

參考文獻


[1] Message Passing Interface Forum. ”MPI: A Message -Passing Interface Standard,” 1994.
[2] Michael J. Quinn. “Parallel Programming in C with MPI and OpenMP.” McGraw Hill, 2004.
[3] Xu, J and Netzer, R.H.D. “Adaptive independent checkpointing for reducing rollback propagation,” in Parallel and Distributed Processing, 1993. Proceedings of the Fifth IEEE Symposium on.
[4] M. Chandy and L. Lamport. “Distributed snapshots: Determining global states of distributed systems,” ACM Transactions on Computing Systems, vol. 3(1), pp. 63-75, Aug. 1985.
[8]Y.M Wang and K. Fuchs. “Optimistic message logging for independent checkpointing in message passing systems,” In Proceedings of the IEEE Symposium on Reliable Distributed Systems, pp. 147-154. Oct. 1992.

延伸閱讀