4.3 Article

Resilient computational applications using Coarray Fortran

Journal

PARALLEL COMPUTING
Volume 81, Issue -, Pages 58-67

Publisher

ELSEVIER SCIENCE BV
DOI: 10.1016/j.parco.2018.12.002

Keywords

Partitioned global address space; Coarray Fortran; Fault tolerance

Funding

  1. Direct For Computer & Info Scie & Enginr
  2. Division of Computing and Communication Foundations [1533850] Funding Source: National Science Foundation

Ask authors/readers for more resources

With the increase in the number of hardware components and layers of the software stack in High Performance Computing (HPC) there will likely be an increment in number of hardware and software failures, which will be user-visible. Even under the most optimistic assumptions about the individual components reliability, probabilistic amplification from using millions of nodes has a dramatic impact on the Mean Time Between Failure (MTBF) of the entire platform. Although several techniques to address this problem have been developed, the support provided by the programming model, for the user to mitigate or work around this issue, is still insufficient. The Fortran 2018 standard defines failed images, a new feature that allows the programmer to detect and manage image failures in a parallel program. In this paper we show how to use failed images and teams, another feature defined in the Fortran 2018 standard, to implement resilient computational applications. (C) 2018 Elsevier B.V. All rights reserved.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.3
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available