4.7 Article

Dynamic and Fault-Tolerant Clustering for Scientific Workflows

期刊

IEEE TRANSACTIONS ON CLOUD COMPUTING
卷 4, 期 1, 页码 49-62

出版社

IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC
DOI: 10.1109/TCC.2015.2427200

关键词

Scientific workflows; fault tolerance; parameter estimation; failure; machine learning; task clustering; job grouping

资金

  1. Direct For Computer & Info Scie & Enginr
  2. Office of Advanced Cyberinfrastructure (OAC) [1148515] Funding Source: National Science Foundation

向作者/读者索取更多资源

Task clustering has proven to be an effective method to reduce execution overhead and to improve the computational granularity of scientific workflow tasks executing on distributed resources. However, a job composed of multiple tasks may have a higher risk of suffering from failures than a single task job. In this paper, we conduct a theoretical analysis of the impact of transient failures on the runtime performance of scientific workflow executions. We propose a general task failure modeling framework that uses a maximum likelihood estimation-based parameter estimation process to model workflow performance. We further propose three fault-tolerant clustering strategies to improve the runtime performance of workflow executions in faulty execution environments. Experimental results show that failures can have significant impact on executions where task clustering policies are not fault-tolerant, and that our solutions yield makespan improvements in such scenarios. In addition, we propose a dynamic task clustering strategy to optimize the workflow's makespan by dynamically adjusting the clustering granularity when failures arise. A trace-based simulation of five real workflows shows that our dynamic method is able to adapt to unexpected behaviors, and yields better makespans when compared to static methods.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据