透過您的圖書館登入
IP:18.191.171.235
  • 學位論文

HDFS分散式檔案系統複本管理對於系統效能影響之研究

Performance Study on Replica Management for Hadoop Distributed File System

指導教授 : 林其誼

摘要


隨著網路技術的快速發展,網路中所產生的數據量也呈指數級增長趨勢,如何對巨量資料進行高效的存取已經成了計算機領域的一個研究熱點。 大數據時代的到來給人類帶來機遇,也帶來挑戰。雲端儲存為大數據提供了理想的儲存解決方案,而其可用性和性能表現則是用户使用雲端儲存的重要考量。HDFS (Hadoop Distributed File System)是Hadoop的一個分散式檔案系统,它可以部署運行在普通的硬體伺服器上。HDFS具有高可靠性及高效存取能力,因為它支援檔案複本技術,不僅能維持檔案較高的可用性,同時還能整體提升系統的性能。 檔案複本技術可分為靜態和動態兩種類型,而動態複本技術相較於靜態複本技術更能滿足雲端儲存複雜環境下的數據存取需求,因此複本係數動態調整策略以及複本放置問題是研究人員關注的重點,也是本文的主要研究内容。 本研究分析現有Hadoop分散式檔案系統靜態複本機制的不足之處,研究改以動態方式調整複本係數,希望能在提升存取性能的同時避免儲存資源的浪費。實驗結果表明改進的複本係數調整策略可以降低系統作業平均反應時間,因此能夠有效地提升檔案存取的性能。 此外,本研究針對複本放置問題,在遵循原生系統給定複本放置基本原則的前提下,根據節點綜合性能評價值的不同來進行合理的複本放置。實驗結果表明,改進的複本放置策略的確在確保系統整體可用性的前提之下,能使複本分布更加趨於合理和均衡,提高了系統的可靠性和處理速度,實現了更好的系統負載均衡。

並列摘要


With the rapid development of Internet, the amount of the resulting data increased exponentially. How to handle the huge amounts of data has become the hot research topic in the field of computer. The advent of the Big Data age brings opportunities and challenges to mankind. Cloud storage provides an ideal storage solution for Big Data. Availability and performance are important considerations for users using cloud storage. HDFS (Hadoop Distributed File System) is a distributed file system which be designed to be suitable for the general hardware. HDFS has high fault tolerance, and can be deployed on the cheap machine. HDFS can provide high throughput data access, very suitable for the application of large data sets. Because it supports file replication technology, it can not only maintain high availability of files, but also improve the performance of the system as a whole. The file replication technology can be divided into two types: static and dynamic. Compare with static replica technology, dynamic replica technology can better meet the requirements of data access in the complex environment of cloud storage. Therefore, the dynamic adjustment strategy of replica factor and the problem of replica placement are the focus of replica technology researchers, which is also the main research content of this thesis. This research analyzes the shortcomings of the existing Hadoop distributed file system static replica mechanism, The research changed to a dynamic way to adjust the replica factor. It is hoped that improves access performance while avoiding waste of storage resources. Experimental results show that the improved replica factor adjustment strategy can reduce the average response time of the system, Therefore, it can effectively improve the performance of file access. Furthermore, this research aiming at the problem of replica placement. On the premise of following the given basic principle of replica placement, replica is placed on the basis of the different evaluation values of the comprehensive performance of the nodes reasonably. The experimental results show that the improved replica placement strategy can make the distribution of replicas more reasonable and balanced under the precondition of ensuring the overall availability of the system; improve the efficiency of the system; better realize the load balancing of cluster.

參考文獻


[1] D. R.-J. G.-J. Rydning, J. Reinsel, and J. Gantz, "The digitization of the world from edge to core," Framingham: International Data Corporation, vol. 16, 2018.
[2] "Apache HBase." https://hbase.apache.org/ (accessed June, 2022).
[3] "MapReduce." https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html (accessed June, 2022).
[4] S. Ghemawat, H. Gobioff, and S.-T. Leung, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles, 2003, pp. 29-43.
[5] "Apache Hadoop." https://hadoop.apache.org/ (accessed June, 2022).

延伸閱讀