低開銷增量式查核點與前滾回復之研究

隨著Windows作業系統的廣泛應用及其應用程式的複雜化，應用程式可能因為潛藏的錯誤而導致故障的發生。如何維護系統不會因為故障導致服務中斷或是資料流失，甚至於導致整部電腦當機的慘劇，對於許多執行重要程式或需要長時間運作的電腦來說，是否具備足夠的容錯能力成為一個非常重要的研究課題本研究的目的在於提高Windows作業系統環境之下運行之應用程式的性能，透過增量式查核點結合寫入時複製技術、最佳化查核點區間設計與前滾回復策略的結合來實現一套具有通透性的低開銷查核點故障回復系統。透過週期性的設置查核點，把行程正常運行時的正確狀態保存到穩定的儲存器之中；當故障發生時，回復策略可以維持任務的正常運作。擷取過大的狀態資料量與隨意的設置查核點會降低系統的性能與昂貴的回復消耗。為了達成行程完成任務之時間開銷最小化的目的，在設置查核點上必須權衡不同的條件與限制，合理的設置查核點區間，採用適當的回復策略，並直接減少儲存查核點的資料量。使用Win32 API函式來正確的擷取行程狀態資訊，並利用增量式查核點技術僅儲存被修改過的頁面，而不是儲存整個行程位址空間，因而減少查核點的資料量，再結合寫入時複製技術來減少儲存狀態時對查核點設置的影響，以便降低設置查核點時的開銷。最後將以機率的方式來討論系統遭遇故障時可能發生的期望時間，並藉此推導出總開銷的最小期望值，以求得設置查核點的最佳化區間。而前滾回復策略可由故障狀態恢復行程的正常操作，並且減少回滾回復的時間浪費，縮短完成任務之總時間開銷。由模擬結果可知，低開銷增量式查核點與前滾回復策略不僅提高了系統的容錯能力，而且還可確實的降低系統因為設置查核點而增加的開銷。

關鍵字

增量式查核點；容錯；最佳化區間；前滾回復；寫入時複製

並列摘要

As Windows Operating systems (OS) are widely used and the developing of application is more complicated. Potential errors in the applications maybe induce the causes of the failures. How to design the system so that software errors do not lead to service interruption, significant loss of computation, or even entire system crash becomes significant. For long running computer or important applications, which equipped with fault tolerance capability has become a major concern. This study describes the implementation of a low overhead checkpointing and roll-forward recovery scheme that consists of incremental checkpointing combines copy-on-write technique, optimal checkpointing interval and roll-forward recovery scheme is addressed. Checkpoints permit to save process state periodically during failure-free execution and the recovery scheme maintains the task executing normally when failure occurs. Excess size of capturing state and arbitrary checkpointing results in either performance degradation or expensive recovery cost. For the objective of minimizing overhead of checkpointing and recovery, the checkpointing and recovery scheme Capturing process state is designed of Win32 API interception associated with incremental checkpointing and copy-on-write technique. Instead of saving entire process space, it only needs to save the changed pages and use buffer to save state temporarily while checkpointing so that the checkpointing overhead is reduced. Expected time is calculated by using probability while system is encountered with a failure, and the minimum expected value of the total overhead of completing a task is thus obtained for the objective of optimal checkpointing interval. The roll-forward recovery scheme resumes the process back to the normal operating status and reduces the total executing time of completing a task when failure occurs. Simulation results show that the proposed low overhead checkpointing and rollback recovery scheme not only enhance the capability of fault tolerance but also reduce the overhead of checkpointing.

並列關鍵字

fault tolerance ； incremental checkpointing ； copy-on-write ； optimal interval ； roll-forward recovery

參考文獻

[7] Shan, F. M., “Design and Implementation of Fault-Tolerant Computer Systems,” Master Thesis, Department of Mechanical Engineering, Chung Yuan Christian University, August 1999.

[1] N.S. Bowen, D.K. Pradhan, “Processor- and Memory-Based Checkpoint and Rollback Recovery,” IEEE Transactions on Computers, Vol. 26(2), pp. 22-31, 1993.

[2] A. Duda, “The effects of checkpointing on program execution time” Information Processing Letters Vol.16 (5), pp. 221-229, 1983.

[3] W. C. Carter, and W. G. Bouricius, “A survey of fault-tolerant computer architecture and its evaluation.” Computer, Ch4, pp. 9-16, January 1971.

[10] M.R. Lyu, “Software Fault Tolerance,” John Wiley & Sons Ltd., 1995.

國際替代計量

低開銷增量式查核點與前滾回復之研究

未授權

主題瀏覽