☆ 4.6 Article

Optimizing data placement in heterogeneous Hadoop clusters

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS (2015)

Journal

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS

Volume 18, Issue 4, Pages 1465-1480

Publisher

SPRINGER

DOI: 10.1007/s10586-015-0495-z

Keywords

Hadoop cluster; HDFS; Data placement; Heterogeneous; Replica

Funding

National Natural Science Foundation of China [61320106007, 61202449, 61572129, 61502097, 61370207]
National High-tech R&D Program of China (863 Program) [2013AA013503]
China Fundamental Research Funds for the Central Universities [1109007115]
Jiangsu research prospective joint research project [BY2012202, BY2013073-01]
Jiangsu Provincial Key Laboratory of Network and Information Security [BM2003201]
Key Laboratory of Computer Network and Information Integration of Ministry of Education of China [93K-9]
Collaborative Innovation Center of Novel Software Technology and Industrialization

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Data placement decision of Hadoop distributed file system (HDFS) is very important for the data locality which is a primary criterion for task scheduling of MapReduce model and eventually affects the application performance. The existing HDFS's rack-aware data placement strategy and replication scheme are work well with MapReduce framework in homogeneous Hadoop clusters, but in practice, such data placement policy can noticeably reduce MapReduce performance and may cause increasingly energy dissipation in heterogeneous environments. Besides that, HDFS employs an inflexible replica factor acquiescently for each data block, which will give rise to unnecessary waste of storage space when there is a lot of inactive data in Hadoop system. In this paper, we propose a novel data placement strategy (SLDP) for heterogeneous Hadoop clusters. SLDP adopts a heterogeneity aware algorithm to divide various nodes into several virtual storage tiers (VSTs) firstly, and then places data blocks across nodes in each VST circuitously according to the hotness of data. Furthermore, SLDP uses a hotness proportional replication to save disk space and also has an effective power control function. Experimental results on two real data-intensive applications show that SLDP is energy-efficient, space-saving and able to improve MapReduce performance in a heterogeneous Hadoop cluster significantly.

Optimizing data placement in heterogeneous Hadoop clusters

Journal

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Optimizing data placement in heterogeneous Hadoop clusters

Journal

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS

Publisher

SPRINGER

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper