由於近年來電腦軟硬體設備的長足進步,使得個人電腦的運算處理效能大幅的增加,而網路技術的蓬勃發展,提供了使用者更豐富且多元的服務,這兩項因素大大提昇了個人電腦叢集(PC Cluster)的來進行高效能運算(High Performace Computing)的可行性,也使得各種叢集管理軟體因應而生。 由於在分散式的環境下,面對分散在各地的節點,發生錯誤的機會會比在單一機器上進行工作來的高,因此在分散式環境中,需要有適當的錯誤處理機制,來避免錯誤、移除錯誤或是轉移錯誤。本研究所提出之具錯誤移轉機制之分散式平行運算平台,建構在目前最受歡迎之Linux作業系統上,將數台個人電腦結合成一個人電腦叢集,形成一強大的高效能計算資源,透過即時的負載監督來進行分派工作及系統內節點的錯誤偵測,並配合檢查點函式庫與自動錯誤回復機制,來達到系統錯誤轉移的目的,以提高系統的可靠性,最後經由實驗來評估本系統負載分配效益以及使用檢查點技術的花費(Overhead),並模擬錯誤轉移的狀況,以驗證本研究之可行性。
In the last few decades, the computing power of personal computers has been increasing because PCs and their peripherals have made vary significant progress; moreover, computer network technologies have their fastest progresses ever in history. Faster processors and better bandwidth has made high performance computing more practical in real-life, especially due to the availability of low-cost computing resources and high-speed computer networks. In the distributed environment, the probability of system error is higher then that in a single computer. Some error processing methods like avoidance of error, elimination of error, or failover are necessary in the distributed environment. In this study, a distributed and parallel processing platform with failover mechanism is proposed. It is built on a PC cluster with popular Linux operating systems. The failover mechanism is composed of error detection, checkpointing, and recovery. The system uses monitor gathering real-time CPU load average to detect error, checkpoint technology to save the job’s state, and auto-recovery mechanism to rollback job’s state. With the three methods, the system can decrease the waste of resource and increase the system’s reliability. The system has shown promising experimental results which demonstrated the feasibility and usefulness of the proposed system.