☆ 4.7 Article

Supporting HPC Analytics Applications with Access Patterns Using Data Restructuring and Data-Centric Scheduling Techniques in MapReduce

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS (2013)

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Volume 24, Issue 1, Pages 158-169

Publisher

IEEE COMPUTER SOC

DOI: 10.1109/TPDS.2012.88

Keywords

HPC analytics framework; data-intensive systems; MapReduce; scheduling

Funding

US National Science Foundation (NSF) [CCF-0811413, CNS-1115665]
National Science Foundation [0953946]
Direct For Computer & Info Scie & Enginr
Division of Computing and Communication Foundations [0953946] Funding Source: National Science Foundation
Direct For Computer & Info Scie & Enginr
Division of Computing and Communication Foundations [0811413] Funding Source: National Science Foundation
Division Of Computer and Network Systems
Direct For Computer & Info Scie & Enginr [1115665] Funding Source: National Science Foundation

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

Current High Performance Computing (HPC) applications have seen an explosive growth in the size of data in recent years. Many application scientists have initiated efforts to integrate data-intensive computing into computational-intensive HPC facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one for analytics. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. That is for every MapReduce phase, a distributed read and write operation on the file system must be performed. Our contribution is to develop a MapReduce-based framework for HPC analytics to eliminate the multiple scans and also reduce the number of data preprocessing MapReduce programs. We also implement a data-centric scheduler to further improve the performance of HPC analytics MapReduce programs by maintaining the data locality. We have added additional expressiveness to the MapReduce language to allow application scientists to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data preprocessing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. Using our augmented Map-Reduce system, MapReduce with Access Patterns (MRAP), we have demonstrated up to 33 percent throughput improvement in one real application, and up to 70 percent in an I/O kernel of another application. Our results for scheduling show up to 49 percent improvement for an I/O kernel of a prevalent HPC analysis application.

Supporting HPC Analytics Applications with Access Patterns Using Data Restructuring and Data-Centric Scheduling Techniques in MapReduce

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Supporting HPC Analytics Applications with Access Patterns Using Data Restructuring and Data-Centric Scheduling Techniques in MapReduce

Journal

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

Publisher

IEEE COMPUTER SOC

Keywords

Categories

Funding

Ask authors/readers for more resources

Protocol

Reagent

Authors

I am an author on this paper

Reviews

Primary Rating

Secondary Ratings

Novelty

Significance

Scientific rigor

Rate this paper

Recommended

Export Citation

Share Paper