Optimizing Read Operations of Hadoop Distributed File System on Heterogeneous Storages

The key challenge in big data processing frameworks such as Hadoop distributed file system (HDFS) is to optimize the throughput for read operations. Toward this goal, several studies have been conducted to enhance read performance on heterogeneous storages. Recently, although HDFS has supported several storage policies for placing data blocks in heterogeneous storages, it fails to fully utilize the potential of fast storages (e.g., SSD). The primary reason for its suboptimal read performance is that, while distributing read requests, existing HDFS only considers the network distance between the client and datanodes, thereby incurring more read requests to slower storages with more data (e.g., HDD). In this paper, we propose a new data retrieval policy for distributing read requests on heterogeneous storages in HDFS. Specifically, the proposed policy considers both the unique characteristics of storages in datanodes and the network environments, to efficiently distribute read requests. We develop several policies including the proposed policy to balance these two factors such as random selection, storage type selection, weighted round-robin selection, and dynamic round-robin selection. Our experimental results show that the throughput of the proposed method outperforms those of the existing policies by up to six times in extensive benchmark datasets.

關鍵字

Hadoop distributed file system ； heterogeneous storage ； data retrieval policy ； MapReduce ； load balancing

國際替代計量

全文下載

主題瀏覽

Optimizing Read Operations of Hadoop Distributed File System on Heterogeneous Storages

摘要

關鍵字

延伸閱讀

國際替代計量

本網站使用Cookies