Adaptive Two-Level Blocking Coordinated Checkpointing for High Performance Cluster Computing Systems

Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce a new approach for blocking coordinated checkpointing using two-level checkpointing. The first level of checkpointing is local checkpointing, and computing nodes save the checkpoints in local disk. If a transient failure occurs in the computing node, the process can recover from local disk. Second level of checkpointing is global checkpointing and computing nodes send their checkpoints to highly reliable global stable storage. If a permanent failure occurs in the computing node, it can not be used and the process can recover from global storage in a new computing node. Local checkpoints are taken more frequently than global checkpoints. Also, in the end of each local checkpointing interval, the system determines the expected recovery time in the case of permanent failure and adaptively takes a global checkpoint, or skips. Experimental results show that average execution time of NAS-BT application is significantly reduced by using the proposed method. Maximum reduction of execution time of this application is 38%.

並列關鍵字

blocking coordinated checkpointing ； transient failure ； permanent failure ； local checkpoint ； global checkpoint ； optimal interval

國際替代計量

全文下載

主題瀏覽