4.6 Article

Optimizing data placement in heterogeneous Hadoop clusters

Publisher

SPRINGER
DOI: 10.1007/s10586-015-0495-z

Keywords

Hadoop cluster; HDFS; Data placement; Heterogeneous; Replica

Funding

  1. National Natural Science Foundation of China [61320106007, 61202449, 61572129, 61502097, 61370207]
  2. National High-tech R&D Program of China (863 Program) [2013AA013503]
  3. China Fundamental Research Funds for the Central Universities [1109007115]
  4. Jiangsu research prospective joint research project [BY2012202, BY2013073-01]
  5. Jiangsu Provincial Key Laboratory of Network and Information Security [BM2003201]
  6. Key Laboratory of Computer Network and Information Integration of Ministry of Education of China [93K-9]
  7. Collaborative Innovation Center of Novel Software Technology and Industrialization

Ask authors/readers for more resources

Data placement decision of Hadoop distributed file system (HDFS) is very important for the data locality which is a primary criterion for task scheduling of MapReduce model and eventually affects the application performance. The existing HDFS's rack-aware data placement strategy and replication scheme are work well with MapReduce framework in homogeneous Hadoop clusters, but in practice, such data placement policy can noticeably reduce MapReduce performance and may cause increasingly energy dissipation in heterogeneous environments. Besides that, HDFS employs an inflexible replica factor acquiescently for each data block, which will give rise to unnecessary waste of storage space when there is a lot of inactive data in Hadoop system. In this paper, we propose a novel data placement strategy (SLDP) for heterogeneous Hadoop clusters. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers (VSTs) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to save disk space and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.6
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available