☆ 4.4 Article

Resilient parallel computing on volunteer PC grids

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE (2018)

期刊

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE

卷 30, 期 18, 页码 -

出版社

WILEY

DOI: 10.1002/cpe.4478

关键词

checkpointing; fault tolerance; host selection; parallel execution; replication; tuplespace; volunteer computing

类别

Computer Science, Software Engineering Computer Science, Theory & Methods

资金

National Science Foundation [CNS-0834750, MCB-0919974, CRI-0958464]

向作者/读者索取更多资源

Protocol

社区支持

Reagent

社区支持

摘要

Volunteer PC hosts represent massive computation capacity at a low cost but are challenging to employ for general parallel computing. This paper presents the design, execution model, implementation, and evaluation of the Volpex framework for robust execution of parallel codes on volunteer PC grids characterized by system and network heterogeneity, varying availability, and frequent failures. The communication model is based on one-sided Put/Get calls to an abstract global shared space enhanced to support multiple autonomous instances of the same process at different stages of execution. Our approach customizes and combines the use of replication, checkpointing, and host selection. This presents formidable challenges that are addressed in this work; efficient checkpointing of distributed replicated processes, dynamic management of redundancy, quick restart in a distributed environment, and application specific host selection. The integrated runtime system is shown to effectively execute moderate size, coarse-grain, communicating codes on a worldwide distributed volunteer environment, a new milestone in volunteer computing. Extensive evaluation is conducted with example scientific codes on a pool of around 600 volunteer hosts. The results demonstrate the trade-offs in deploying checkpointing, redundancy, and host selection, and how these methods combine to provide application performance that is close to the ideal failure free performance.

Resilient parallel computing on volunteer PC grids

期刊

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE

出版社

WILEY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

Resilient parallel computing on volunteer PC grids

期刊

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE

出版社

WILEY

关键词

类别

资金

向作者/读者索取更多资源

Protocol

Reagent

作者

我是这篇论文的作者

评论

主要评分

次要评分

新颖性

重要性

科学严谨性

评价这篇论文

推荐

导出引文

分享论文