4.7 Article

McTAR: A Multi-Trigger Checkpointing Tactic for Fast Task Recovery in MapReduce

Journal

IEEE TRANSACTIONS ON SERVICES COMPUTING
Volume 14, Issue 6, Pages 1824-1836

Publisher

IEEE COMPUTER SOC
DOI: 10.1109/TSC.2019.2904270

Keywords

Task analysis; Fault tolerance; Fault tolerant systems; Cloud computing; Checkpointing; Big Data; Delays; Checkpoint; failure prediction; fault tolerance; Hadoop MapReduce; task recovery

Funding

  1. National Natural Science Foundation of China [61662051, 61662054]

Ask authors/readers for more resources

This paper proposes a novel multi-trigger checkpointing approach for fast recovery of MapReduce tasks, named McTAR. This approach can effectively speed up the recovery process of MapReduce jobs and highly reduce the task recovery delay.
Cloud computing and big data technologies have gained great popularity in recent years. MapReduce is still one of the most efficient and well-adopted computing paradigms for providing big data services. MapReduce applications need to be executed on cloud platform where failures are inevitable. Hadoop is the de facto implementation of MapReduce, but it deploys a coarse grained and unsatisfactory fault tolerant services. The failed tasks are rescheduled from scratch to re-execute from the very beginning, which apparently brings amount of overload for failure recovery, and the whole job would be heavily delayed as failures happen. In this paper, we propose a novel multi-trigger checkpointing approach for fast recovery of MapReduce tasks, named a Multi-trigger Checkpointing Tactic for fAst TAsk Recovery (McTAR). As a finer-grained and better fault tolerance tactic, our McTAR employs multi-trigger checkpoint generation, push-pull combined intermediate data distribution and optimized failure task prediction techniques together to make the recovery task attempt be able to start at a specific progress according to the valid checkpoint for intermediate data. In this way, McTAR could effectively speed up the recovery process of MapReduce jobs and highly reduce the task recovery delay.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available