4.4 Article

Resilient parallel computing on volunteer PC grids

出版社

WILEY
DOI: 10.1002/cpe.4478

关键词

checkpointing; fault tolerance; host selection; parallel execution; replication; tuplespace; volunteer computing

资金

  1. National Science Foundation [CNS-0834750, MCB-0919974, CRI-0958464]

向作者/读者索取更多资源

Volunteer PC hosts represent massive computation capacity at a low cost but are challenging to employ for general parallel computing. This paper presents the design, execution model, implementation, and evaluation of the Volpex framework for robust execution of parallel codes on volunteer PC grids characterized by system and network heterogeneity, varying availability, and frequent failures. The communication model is based on one-sided Put/Get calls to an abstract global shared space enhanced to support multiple autonomous instances of the same process at different stages of execution. Our approach customizes and combines the use of replication, checkpointing, and host selection. This presents formidable challenges that are addressed in this work; efficient checkpointing of distributed replicated processes, dynamic management of redundancy, quick restart in a distributed environment, and application specific host selection. The integrated runtime system is shown to effectively execute moderate size, coarse-grain, communicating codes on a worldwide distributed volunteer environment, a new milestone in volunteer computing. Extensive evaluation is conducted with example scientific codes on a pool of around 600 volunteer hosts. The results demonstrate the trade-offs in deploying checkpointing, redundancy, and host selection, and how these methods combine to provide application performance that is close to the ideal failure free performance.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.4
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据