容錯機制在動態分散式行動計算環境之初步研究

容錯能力是分散式系統中不可或缺的機制，避免因為部份節點發生錯誤而造成整個系統崩潰。動態分散式行動計算環境(Dynamic Distributed Mobile Computing Environment, D2MCE)，是使用分散式共享記憶體(Distributed Shared Memory, DSM)架構出來的運算環境，運算節點可以動態加入參與運算或是自願性離開。分散式共享記憶體是邏輯上的全域記憶體，使各節點可以存取相同的資料，D2MCE是採用鬆散的記憶體一致性模型－HERC，使得各節點在需要最新資料的時候透過管理(Home)節點取得，此方法降低不必要的資料傳輸並保持最新的資料至少在一個節點上。本論文提出容錯機制的實作於D2MCE。因為運算節點是各自獨立的裝置，有可能因為網路斷線或是裝置故障，導致系統無法正常運作。我們針對HERC演算法加上管理節點的備份機制，使得共享資料至少保存在二個節點以上，達到容忍一個節點發生錯誤，而資料不會遺失。本論文另提出工作排程機制。此機制將整個運算分成數個獨立的工作，由工作管理員配發工作給各節點，節點完成工作後，必須提交給工作管理員才算成功。這個機制可以容忍工作管理員以外的節點發生錯誤。

關鍵字

動態分散式行動運算環境；分散式共享記憶體系統；容錯機制

並列摘要

Fault tolerance is an indispensable mechanism for distributed systems, to prevent whole system from a crash caused by some node failure. D2MCE (Dynamic Distributed Mobile Computing Environment) construct a computing environment using Distributed Shared Memory (DSM). Compute nodes can join dynamically and leave autonomously. Distributed Shared Memory is a logically global memory, so that each node can access the shared data. D2MCE implements a loosely memory consistency model, called HERC. HERC allows each node to get the up-to-date data via home node. Its purpose is to reduce unnecessary data transmission and this will maintain the latest information in at least one node. This paper proposes a fault tolerance mechanism in D2MCE, in which the compute nodes may incur the problems of network disconnection or device malfunction, causing the system not working properly. We add a backup mechanism for the home node to the HERC algorithm, which makes the data stored in at least two nodes. This enables D2MCE to tolerate single node failure, without losing the shared data. This paper also proposes another mechanism called job scheduler. The job manager divides the whole work into several jobs and assigns jobs to nodes. After the node has finished its job, it needs deliver the job to the job manager to confirm that the process has been finished. This mechanism can tolerate node failure.

並列關鍵字

D2MCE ； DSM ； Fault Tolerance

參考文獻

[4] 呂宗螢, "動態分散式行動計算環境效能改進之研究," 碩士, 台北科技大學, 台北, 2009.

[1] R. Buyya, C. S. Yeo, and S. Venugopal, "Market-Oriented Cloud Computing: Vision, Hype, and Reality for Delivering IT Services as Computing Utilities," in 10th IEEE International Conference on High Performance Computing and Communications, 2008. HPCC '08., 2008, pp. 5-13.

[3] K. Li and P. Hudak, "Memory coherence in shared virtual memory systems," ACM Trans. Comput. Syst., vol. 7, pp. 321-359, 1989.

[5] C. Morin, R. Lottiaux, G. Vallee, P. Gallard, D. Margery, J. Y. Berthou, and I. D. Scherson, "Kerrighed and data parallelism: cluster computing on single system image operating systems," in IEEE International Conference on Cluster Computing, 2004, 2004, pp. 277-286.

[6] A. Barak and O. La'adan, "The MOSIX multicomputer operating system for high performance cluster computing," Future Gener. Comput. Syst., vol. 13, pp. 361-372, 1998.

國際替代計量

容錯機制在動態分散式行動計算環境之初步研究

未授權

主題瀏覽