Blocking coordinated checkpointing is a well-known method for achieving fault tolerance in cluster computing systems. In this work, we introduce a new approach for blocking coordinated checkpointing using two-level checkpointing. The first level of checkpointing is local checkpointing, and computing nodes save the checkpoints in local disk. If a transient failure occurs in the computing node, the process can recover from local disk. Second level of checkpointing is global checkpointing and computing nodes send their checkpoints to highly reliable global stable storage. If a permanent failure occurs in the computing node, it can not be used and the process can recover from global storage in a new computing node. Local checkpoints are taken more frequently than global checkpoints. Also, in the end of each local checkpointing interval, the system determines the expected recovery time in the case of permanent failure and adaptively takes a global checkpoint, or skips. Experimental results show that average execution time of NAS-BT application is significantly reduced by using the proposed method. Maximum reduction of execution time of this application is 38%.
為了持續優化網站功能與使用者體驗,本網站將Cookies分析技術用於網站營運、分析和個人化服務之目的。
若您繼續瀏覽本網站,即表示您同意本網站使用Cookies。