透過您的圖書館登入
IP:18.218.55.14
  • 學位論文

容錯及錯誤回復叢集系統在科學計算之實作

Implementation of a Fault Tolerant Cluster with Error Recovery for Scientific Computation

指導教授 : 郭斯彥

並列摘要


Recently, parallel computing is one of the main techniques to enhance computer performance. High performance computer can be applied to different fields, including commerce, national defense, and science. Numerical simulation is an important method that flourished science today. The simulation will fail if there is a intrusion during the simulation, so fault tolerance is an important issue. There are two main categories of fault tolerant techniques, 1) Automatic, and 2)Non-Automatic. Basic automatic fault tolerant techniques applied on clusters will be discussed, which includes coordinated, uncoordinated checkpoints and pessimistic, optimistic message logging. An automatic fault tolerant cluster under a scientific computational environment will be implemented with coordinated checkpoint. A storage backup strategy will also be implemented with a redundant array of inexpensive disks level five network file server.

參考文獻


[1] S. Hariri, M. Parashar, “Tools and Environments for Parallel and Distributed Computing”, Wiley, 2004.
[2] I. Campbell, “Reliable Linux: assuring high availability”, John Wiley & Suns, New York, 2002.
[6] B. Wilinson, M. Allen, “Parallel Programming – Techniques and applications using networked workstations and parallel computers”, Prentice Hall, New Jersey, 1999.
[7] Peter S. Pacheco, “Parallel Programming with MPI”, Morgan Kaufmann Publishers, San Fransisco, 1997.
[8] Message Passing Interface Forum, “MPI: A Message-Passing Interface Standard”, 1995.

延伸閱讀