容錯及錯誤回復叢集系統在科學計算之實作

Recently, parallel computing is one of the main techniques to enhance computer performance. High performance computer can be applied to different fields, including commerce, national defense, and science. Numerical simulation is an important method that flourished science today. The simulation will fail if there is a intrusion during the simulation, so fault tolerance is an important issue. There are two main categories of fault tolerant techniques, 1) Automatic, and 2)Non-Automatic. Basic automatic fault tolerant techniques applied on clusters will be discussed, which includes coordinated, uncoordinated checkpoints and pessimistic, optimistic message logging. An automatic fault tolerant cluster under a scientific computational environment will be implemented with coordinated checkpoint. A storage backup strategy will also be implemented with a redundant array of inexpensive disks level five network file server.

並列關鍵字

Fault Tolerant ； cluster ； checkpoint ； message log ； redundant array of inexpensive disks ； network file server

參考文獻

[1] S. Hariri, M. Parashar, “Tools and Environments for Parallel and Distributed Computing”, Wiley, 2004.

[2] I. Campbell, “Reliable Linux: assuring high availability”, John Wiley & Suns, New York, 2002.

[6] B. Wilinson, M. Allen, “Parallel Programming – Techniques and applications using networked workstations and parallel computers”, Prentice Hall, New Jersey, 1999.

[7] Peter S. Pacheco, “Parallel Programming with MPI”, Morgan Kaufmann Publishers, San Fransisco, 1997.

[8] Message Passing Interface Forum, “MPI: A Message-Passing Interface Standard”, 1995.

國際替代計量

容錯及錯誤回復叢集系統在科學計算之實作

全文下載

主題瀏覽